musl - Re: [PATCH v2] IDNA support in name lookups

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170423081424.GA15554@wirbelwind>
Date: Sun, 23 Apr 2017 10:14:24 +0200
From: Joakim Sindholt <opensource@...sha.com>
To: musl@...ts.openwall.com
Subject: Re: [PATCH v2] IDNA support in name lookups

On Sat, Apr 22, 2017 at 09:01:00PM -0400, Rich Felker wrote:
> On Sun, Apr 02, 2017 at 09:30:26AM +0200, Joakim Sindholt wrote:
> > Changes since v1:
> > * Reject UTF-16 surrogate range runes
> > * Remove locale override
> > 
> > This is from some discussion on IRC and while I agree that it's more
> > "correct" in POSIX terms, I'm not particularly happy about having to
> > explicitly enable UTF-8 support with setlocale.
> 
> Yes, I'm not really happy about the decision on the C locale either,
> but I understand the reasons various parties wanted it that way and I
> think trying to follow the closest-to-working consensus process we
> have is better than just following it when we agree with the outcomes.

Understood. I agree that we need to be consistent.
 
> > There might still be bugs and character ranges that need to be rejected.
> 
> As far as I can tell, no normalization is done. This might be
> problematic for strings where the natural way users would type it does
> not match the normalized form required in IDN's, but it would also be
> expensive to handle. I think it's okay to punt on this until it proves
> to actually be a problem.

It does ASCII normalization (tolower) but not unicode nomalization. I
did a quick search for it and came up with RFC5892[1], which specifies a
rather intricate set of characters to be allowed.
Maybe I'll go through it one day but it will most certainly be in
another patch.

> > >From 54d5caf36cdce4e5008aecfcc2b02580fb52d0cb Mon Sep 17 00:00:00 2001
> > From: Joakim Sindholt <opensource@...sha.com>
> > Date: Wed, 29 Mar 2017 11:51:02 +0200
> > Subject: [PATCH] add IDNA support to name lookups
> > 
> > ---
> >  src/network/lookup_name.c | 202 +++++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 193 insertions(+), 9 deletions(-)
> > 
> > diff --git a/src/network/lookup_name.c b/src/network/lookup_name.c
> > index fb7303a..fd4275c 100644
> > --- a/src/network/lookup_name.c
> > +++ b/src/network/lookup_name.c
> 
> I think I'd rather put these functions in their own source file, just
> because they're logically distinct, even if this is the only caller.

No objections.

> > @@ -10,9 +10,21 @@
> >  #include <unistd.h>
> >  #include <pthread.h>
> >  #include <errno.h>
> > +#include <wchar.h>
> >  #include "lookup.h"
> >  #include "stdio_impl.h"
> >  #include "syscall.h"
> > +#include "locale_impl.h"
> 
> Is locale_impl.h actually used now?

No, I just forgot to remove it.

[ snip ]
> > @@ -61,12 +230,25 @@ static int name_from_hosts(struct address buf[static MAXADDRS], char canon[stati
> >  		return EAI_SYSTEM;
> >  	}
> >  	while (fgets(line, sizeof line, f) && cnt < MAXADDRS) {
> > -		char *p, *z;
> > +		char idna[256];
> > +		ssize_t r;
> > +		char *p, *z, c;
> >  
> >  		if ((p=strchr(line, '#'))) *p++='\n', *p=0;
> > -		for(p=line+1; (p=strstr(p, name)) &&
> > -			(!isspace(p[-1]) || !isspace(p[l])); p++);
> > -		if (!p) continue;
> > +		/* skip ip address and canonicalize names */
> > +		for (p=line; *p && !isspace(*p); p++);
> > +		while (*p) {
> > +			for (; *p && isspace(*p); p++);
> > +			for (z=p; *z && !isspace(*z); z++);
> > +			c = *z;
> > +			*z = 0;
> > +			r = idnaenc(idna, p);
> > +			*z = c;
> > +			if (r == l && memcmp(idna, name, l) == 0)
> > +				break;
> > +			p = z;
> > +		}
> > +		if (!*p) continue;
> >  
> >  		/* Isolate IP address to parse */
> >  		for (p=line; *p && !isspace(*p); p++);
> > @@ -86,7 +268,7 @@ static int name_from_hosts(struct address buf[static MAXADDRS], char canon[stati
> >  		for (; *p && isspace(*p); p++);
> >  		for (z=p; *z && !isspace(*z); z++);
> >  		*z = 0;
> > -		if (is_valid_hostname(p)) memcpy(canon, p, z-p+1);
> > +		if ((r = idnaenc(idna, p)) > 0) memcpy(canon, idna, r);
> >  	}
> >  	__fclose_ca(f);
> >  	return cnt ? cnt : badfam;
> > @@ -285,15 +467,17 @@ static int addrcmp(const void *_a, const void *_b)
> 
> Is there any reason this needs to be done, or should be done, for
> lookups from the hosts file? IDN/punycode is a hack for transporting
> unicode names on top of DNS protocol. For hosts file you can just put
> the proper unicode strings directly in the file.

My logic was that some people might have/want to have punycode in their
hosts file, and some might even (accidentally or otherwise) have mixed
punycode-unicode names written down. In any case I wanted it to Just
Work™ so decoding the host from punycode before comparing seemed to be
the easiest way to ensure it catches everything.
This was prompted by a paper wedding invitaion I received where the
couple had listed their gift registry in punycode form. This says to me
that people just dont know or care about this, and since the hosts file
is used extensively by non-developers as well I would personally prefer
if this worked regardless of what deranged things people might put in
there.
In fact I could imagine a lot of people shoving the punycode form in
there under the assumption that that would work better.
Also, suppose you have callers from all over the system dialing out to a
server but some of them call xn--foo-bar.com and others dial the unicode
version. Is it really reasonable that you should need to list this
domain twice in the hosts file for it to work?

> >  int __lookup_name(struct address buf[static MAXADDRS], char canon[static 256], const char *name, int family, int flags)
> >  {
> > +	char _name[256];
> >  	int cnt = 0, i, j;
> >  
> >  	*canon = 0;
> >  	if (name) {
> > -		/* reject empty name and check len so it fits into temp bufs */
> > -		size_t l = strnlen(name, 255);
> > -		if (l-1 >= 254)
> > +		/* convert unicode name to RFC3492 punycode */
> > +		ssize_t l;
> > +		if ((l = idnaenc(_name, name)) <= 0)
> >  			return EAI_NONAME;
> > -		memcpy(canon, name, l+1);
> > +		memcpy(canon, _name, l+1);
> > +		name = _name;
> >  	}
> 
> If it's not needed for hosts backend, this code probably belongs
> localized to the dns lookup, rather than at the top of __lookup_name.
> 
> BTW there's perhaps also a need for the opposite-direction
> translation, both for ai_canonname (when a CNAME points to IDN) and
> for getnameinfo reverse lookups. But that can be added as a second
> patch I think.

I have already written the code for decoding as well if need be :)

The only problem as I see it is that a unicode name can be a hair under
4 times larger (in bytes) than the punycode equivalent. Select any 4
byte UTF-8 character and make labels exclusively containing that. All
subsequent characters to the first will be encoded as an 'a'.

This, by the way, also means that we should probably mess with the
buffering when reading the hosts file.

[1] https://tools.ietf.org/html/rfc5892
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.