musl - Re: [PATCH v2] IDNA support in name lookups

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20170423165614.GP17319@brightrain.aerifal.cx>
Date: Sun, 23 Apr 2017 12:56:14 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: [PATCH v2] IDNA support in name lookups

On Sun, Apr 23, 2017 at 06:38:24PM +0200, Joakim Sindholt wrote:
> > If I'm not mistaken, your patch as-is actually _breaks_ support for
> > literal utf-8 hostnames in the hosts file.
> 
> No, it supports all combinations. It encodes every hostname found to the
> punycode equivalent and compares only punycode. The encoding is done on
> a per-label basis so ønskeskyen.dk is "xn--nskeskyen-k8a" and "dk" treated
> entirely separately.
> Suppose they had a subdomain, ønsker.ønskeskyen.dk, then this approach
> would work even if someone put "ønsker.xn--nskeskyen-k8a.dk" in their
> hosts file, even if you looked it up as xn--nsker-uua.ønskeskyen.dk.

OK, that makes more sense and it's non-harmful.

> There are a couple of reasons that I chose to do it this way, with the
> primary being that it meant I only had to include the encoding function
> and the secondary being that the encoded name will always fit in 254
> characters, whereas the decoded equivalent might be larger (see below).
> 
> I don't have any ideological opposition to doing the comparison entirely
> in unicode though. This was just the more practical solution for the
> time being.

It's probably okay the way you're doing it, then. I failed to
understand that it accepts either. Main other consideration might be
whether it's noticably slow on large hosts files, but I don't think
the existing code was terribly fast anyway.

> > > > >  int __lookup_name(struct address buf[static MAXADDRS], char canon[static 256], const char *name, int family, int flags)
> > > > >  {
> > > > > +	char _name[256];
> > > > >  	int cnt = 0, i, j;
> > > > >  
> > > > >  	*canon = 0;
> > > > >  	if (name) {
> > > > > -		/* reject empty name and check len so it fits into temp bufs */
> > > > > -		size_t l = strnlen(name, 255);
> > > > > -		if (l-1 >= 254)
> > > > > +		/* convert unicode name to RFC3492 punycode */
> > > > > +		ssize_t l;
> > > > > +		if ((l = idnaenc(_name, name)) <= 0)
> > > > >  			return EAI_NONAME;
> > > > > -		memcpy(canon, name, l+1);
> > > > > +		memcpy(canon, _name, l+1);
> > > > > +		name = _name;
> > > > >  	}
> > > > 
> > > > If it's not needed for hosts backend, this code probably belongs
> > > > localized to the dns lookup, rather than at the top of __lookup_name.
> > > > 
> > > > BTW there's perhaps also a need for the opposite-direction
> > > > translation, both for ai_canonname (when a CNAME points to IDN) and
> > > > for getnameinfo reverse lookups. But that can be added as a second
> > > > patch I think.
> > > 
> > > I have already written the code for decoding as well if need be :)
> > 
> > Yay!
> > 
> > > The only problem as I see it is that a unicode name can be a hair under
> > > 4 times larger (in bytes) than the punycode equivalent. Select any 4
> > > byte UTF-8 character and make labels exclusively containing that. All
> > > subsequent characters to the first will be encoded as an 'a'.
> > > 
> > > This, by the way, also means that we should probably mess with the
> > > buffering when reading the hosts file.
> > 
> > You mean increase it? Since HOST_NAME_MAX/NI_MAXHOST is 255 and this
> > is ABI, I don't think we can return names longer than 255 bytes. Such
> > names probably just need to be left as punycode when coming from dns,
> > and not supported in hosts files.
> 
> I'm not sure that's the correct interpretation. For example when doing
> HTTP requests the browser will fill in the HTTP Host header with the
> punycode name. I think the intent is to do everything with the punycoded
> name and ditch utf8 permanently :(

That's ugly and backwards but fits with the header encoding being
ASCII. However it's not universal. The most important other place
domain names are used, in certificates, they're proper unicode;
punycode is not used there.

> What I meant here though was when reading from /etc/hosts the line
> buffer used is 512 bytes which isn't necessarily enough for one full
> domain in utf8.

But it is enough assuming we honor HOST_NAME_MAX.

> Take any cuneiform character of your choice, repeat it
> 56 times and you get a label of 63 chars that looks something like
> xn--4c4daaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
> but has a utf8 size of 224 bytes. You can have almost 4 of those in an
> otherwise completely valid FQDN. The last one would have to lose 2 'a's
> (8 bytes) making the total 63+1+63+1+63+1+61+1=254, decoding to a total
> of 224+1+224+1+224+1+216+1=892 bytes.
> 
> Now I will freely admit that I don't know for sure if this is legal but
> I believe that it is, and that the maximum length of 254 applies solely
> to the encoded name.
> I can at least say that dig will happily accept that massive domain
> name and complain on a domain with one more cuneiform in the last label.
> It will also complain if any one label exceeds 63 bytes post-encoding.
> 
> For the record, glibc has #define NI_MAXHOST 1025, if it matters at all.

Yeah, so far I've considered that a mistake. For non-DNS names it's
rather up to the implementation what they support, but I don't think
there's significant merit in supporting gratuitously/maliciously long
hostnames.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.