Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 30 Mar 2018 15:35:48 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: [PATCH] resolver: only exit the search path loop there
 are a positive number of results given

On Fri, Mar 30, 2018 at 02:19:48PM -0500, William Pitcock wrote:
> Hello,
> 
> On Fri, Mar 30, 2018 at 2:14 PM, Rich Felker <dalias@...c.org> wrote:
> > On Fri, Mar 30, 2018 at 06:52:25PM +0000, William Pitcock wrote:
> >> In the event of no results being given by any of the lookup modules, EAI_NONAME will still
> >> be thrown.
> >>
> >> This is intended to mitigate problems that occur when zones are hosted by weird DNS servers,
> >> such as the one Cloudflare have implemented, and appear in the search path.
> >> ---
> >>  src/network/lookup_name.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/src/network/lookup_name.c b/src/network/lookup_name.c
> >> index 209c20f0..b068bb92 100644
> >> --- a/src/network/lookup_name.c
> >> +++ b/src/network/lookup_name.c
> >> @@ -202,7 +202,7 @@ static int name_from_dns_search(struct address buf[static MAXADDRS], char canon[
> >>                       memcpy(canon+l+1, p, z-p);
> >>                       canon[z-p+1+l] = 0;
> >>                       int cnt = name_from_dns(buf, canon, canon, family, &conf);
> >> -                     if (cnt) return cnt;
> >> +                     if (cnt > 0) return cnt;
> >>               }
> >>       }
> >
> > This patch is incorrect, and the reason should be an FAQ item if it's
> > not already. Only a return value of 0 means that the requested name
> > does not exist and that it's permissible to continue search. Other
> > nonpositive return values indicate either that the name does exist but
> > does not have a record of the quested type, or that a transient error
> > occurred, making it impossible to determine whether the search can be
> > continued and thus requiring the error to be reported to the caller.
> > Anything else results in one or both of the following bugs:
> >
> > - Nondeterministically returning different results for the same query
> >   depending on transient unavailability of the nameservers to answer
> >   on time.
> >
> > - Returning inconsistent results (for different search components)
> >   depending on whether AF_INET, AF_INET6, or AF_UNSPEC was requested.
> >
> > I'm aware that at least rancher-dns and Cloudflare's nameservers have
> > had bugs related to this issue. I'm not sure what the status on
> > getting them fixed is, and for Cloudflare I don't know exactly what it
> > is they're doing wrong or why. But I do know the problem is that
> > they're returning semantically incorrect dns replies.
> 
> Kubernetes imposes a default search path with the cluster domain last, so:
> 
>   - local.prod.svc.whatever
>   - prod.svc.whatever
>   - svc.whatever
>   - yourdomain.com
> 
> The cloudflare issue is that they send SUCCESS code with 0 replies,
> which causes musl to error when it hits the yourdomain.com.

Yes, that makes sense. Do you know why they're doing it? If they
refuse to fix it, the only clean fix I know is a local proxy
configured to fix the records for the specific broken domains you care
about. But of course that's not convenient.

> Do you have any suggestions on a mitigation which would be more
> palatable?  We need to ship a mitigation for this in Alpine 3.8
> regardless.  I would much rather carry a patch that is upstreamable,
> but I am quite willing to carry one that isn't, in order to solve this
> problem.

A theoretically-non-horrible (but somewhat costly) solution is to
always query both A and AAAA, rather than only doing it for AF_UNSPEC.
Then if you see a reply with 0 (total, between both) records, you can
opt to interpret that the same way as NxDomain without breaking
consistency properties. If Cloudflare refuses to fix the bug, maybe we
should consider adding an _option_ (in the resolv.conf options line)
to do this. I don't think it should be the default behavior because it
mildly slows down lookups, especially if you have nontrivial packet
loss since probability of failure is now 1-(1-p)²=2p-p² rather than p
(where p is the packet loss rate).

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.