musl - Re: Re: Would love to see reconsideration for domain and search

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAO_Rewbizifjh=QF6cBfx=auXaogZ18aWzQbEYH83Ymjk-ELmQ@mail.gmail.com>
Date: Thu, 22 Oct 2015 16:37:18 -0700
From: Tim Hockin <thockin@...gle.com>
To: musl@...ts.openwall.com
Cc: Rich Felker <dalias@...c.org>
Subject: Re: Re: Would love to see reconsideration for domain and search

On Thu, Oct 22, 2015 at 4:00 PM, Josiah Worcester <josiahw@...il.com> wrote:
> On Thu, Oct 22, 2015 at 3:37 PM Tim Hockin <thockin@...gle.com> wrote:
>>
>> On Thu, Oct 22, 2015 at 2:56 PM, Rich Felker <dalias@...c.org> wrote:
>> > On Thu, Oct 22, 2015 at 02:24:11PM -0700, Tim Hockin wrote:
>> >> Hi all,
>> >>
>> >> I saw this thread on the web archive but am not sure how to respond to
>> >> the thread directly as a new joinee of the ML.  I hope this finds its
>> >> way...
>> >
>> > No problem; just starting a new thread like this and quoting the old
>> > one is fine.
>> >
>> >> I am one of the developers of Kubernetes and I own the DNS portion, in
>> >> particular.  I desperately want to use Alpine Linux (based on musl)
>> >> but for now I have to warn people NOT to use it because of this issue.
>> >>
>> >> On Fri, Sep 04, 2015 at 02:04:29PM -0400, Rich Felker wrote:
>> >> > On Fri, Sep 04, 2015 at 12:11:36PM -0500, Andy Shinn wrote:
>> >> >> I'm writing the wonderful musl project today to open discussion
>> >> >> about the future possibility of DNS search and domain keyword
>> >> >> support. We've been using musl libc (by way of Alpine Linux) for
>> >> >> new development of applications as containers that discover each
>> >> >> other through DNS and other software defined networking.
>> >> >>
>> >> >> In particular, we are starting to use applications like SkyDNS,
>> >> >> Consul, and Kubernetes, all of which rely on local name
>> >> >> resolution in some way using search paths. Many users of the
>> >> >> Alpine Linux container image have also expressed their desire for
>> >> >> this feature at
>> >> >> https://github.com/gliderlabs/docker-alpine/issues/8.
>> >> >>
>> >> >> On the functional differences between glibc page, the domain and
>> >> >> search keyword "Support may be added in the future if there is
>> >> >> demand". So please consider this request an addition to whatever
>> >> >> demand for the feature already exists.
>> >> >>
>> >> >> Thank you for your time and great work on the musl libc project!
>> >> >
>> >> > I think this is a reasonable request. I'll look into it more.
>> >> >
>> >> > One property I do not want to break is deterministic results, so
>> >> > when a search is performed, if any step of the search ends with
>> >> > an error rather than a positive or negative result, the whole
>> >> > lookup needs to stop and report the error rather than falling
>> >> > back. Falling back is not safe and creates a situation where DoS
>> >> > can be used to control which results are returned.
>> >>
>> >> I understand your point, though the world at large tends to disagree.
>> >> Everyone has a primary and secondary `nameserver` record (or should).
>> >> If the first one times out, try the second.  Most resolver libs seem
>> >> to accept a SERVFAIL response or a timeout as a signal to try the next
>> >> server, and I would encourage you to do the same.
>> >
>> > musl intentionally does not do this because it yields abysmal
>> > performance. If the first nameserver is overloaded or the packet is
>> > lost, you suffer several-second lookup latency.
>>
>> But at least it works eventually.  You're faced with a choice.  Wait 2
>> seconds for ns1 to timeout and then fail in a way that most apps don't
>> handle well or wait for 2 seconds and then (usually) get a fast
>> response from ns2.
>>
>> It seems better in every way to eventually succeed, though I agree
>> it's a bit less visible.
>
>
> With musl's current design, you get a request to ns1 and ns2, and the first
> authoritative response wins. So, if ns1 fails then all is well and
> performance isn't even notably impacted. What you are describing appears to
> be how you would *have* to implement it if you decide against considering
> all servers equal, but instead try and serve the union of their responses
> (that is, wait for timeout and then fail).

The authoritative-ness is a dimension I had not considered.  I could
believe that the first authoritative answer wins, but what if you only
get a non-authoritative answer? from ns1 and ns2 never responds?

> Consider what would happen if ns1 and ns2 have different responses, but ns1
> for whatever reason times out (potentially an attacker). Then you get the
> results for ns2, even though ns1 is intended to override it.

I agree in theory.  And yet this is how most resolvers work today.
Are they all broken?

>> >> Stopping on positive response or NXDOMAIN seems to be commonly
>> >> accepted with a caveat.  You can't query all nameservers and just take
>> >> the first NXDOMAIN to respond.  You can only accept NXDOMAIN if all of
>> >> the higher-priority (listed first in resolv.conf) nameservers have
>> >> timed out or SERVFAIL'ed.  You can issue queries in parallel, but you
>> >> must process responses in order, which is what you describe below.
>> >
>> > Timeout or servfail is not sufficient to accept an nxdomain from a
>> > lower-priority server. To preserve consistency of results under
>> > transient failure, you actually have to wait for the nxdomain from the
>>
>> I have to disagree.  Some non-forwarding DNS servers use SERVFAIL to
>> indicate "I am not serving for that domain" specifically to make the
>> client move to their next nameserver.  if ns1 returns SERVFAIL, try
>> ns2.  If ns1 times out, try ns2.  Otherwise what good is ns2?
>
>
> Note that this means that any condition where ns1 can't be accessed changes
> what DNS resolves to. If you wish to prevent unexpected behavior when ns1
> can't be accessed, you simply have to get a response from ns1 first, and
> only ever query ns2 when the response from ns1 indicates you may.

If you get a response from ns1, why would you ever go to ns2?

>> > higher-priority server. Either way, this very much pessimizes usage
>> > cases like running "netstat" with huge numbers of connections where
>> > many of the ip addresses fail to reverse. Being able to return
>> > immediately as soon as any one of the nameservers responds with
>> > nxdomain makes the difference between a <1s netstat run and a 5-10s
>> > netstat run.
>>
>> Sure it's faster but it's WRONG.  Returning a random number would be
>> faster, too, but it is equally wrong.  This is why netstat (and myriad
>> other tools) has a `-n` flag.
>
>
>  Again, musl's design assumes that all nameservers are hosting the same
> space, and thus a single nxdomain is authoritative. If this is the case,
> then this is perfectly correct. If it's not the case, then it's wrong (and

Agree

> you need to wait for literally everything to nxdomain, and if there's a
> timeout from a single server you *need* to report that).

I don't buy the "everything" part.  If we assume that nameserver
ordering is priority ordering, the first NXDOMAIN is sufficient.   But
I can see how this is maybe not the normal case, and we're probably
moving to a model more consistent with musl's view (but we still need
search :)

>> > Thus, if we add extensions to support the kind of result unioning you
>> > want across multiple nameservers, I think they should be configurable
>> > and off-by-default. A simple option in resolv.conf could turn them on.
>> > And there could be options for requiring nxdomain from all servers
>> > (true union) or just for highest-priority when accepting negative
>> > results.
>>
>> I can't agree with this.  It's reasonable to make options for these,
>> but I think the non-standard behaviors should be off by default.
>> Consider this from the point of view of a system like Docker or
>> Kubernetes, which generate resolv.conf for you - they have no idea
>> what libc your apps are using, so it's unreasonable to ask them to
>> turn off libc-specific flags.  However, the end user knows, and it is
>> perfectly sane to ask them to explicitly opt-in to non-standard
>> optimized behaviors.
>>
>> >> > While it would be possible to parallelize the search while
>> >> > serializing the results (i.e. waiting to accept a result from the
>> >> > second query until the first query finished with a negative
>> >> > result), I think the consensus during the last round of
>> >> > discussion of this topic was that the complexity cost is too
>> >> > great and the benefit too small. Ideally, the first query should
>> >> > always succeed, anyway.
>> >>
>> >> The real world is not ideal.  Not all nameservers are identically
>> >> scoped - you MUST respect the ordering in resolv.conf - to do
>> >> otherwise is semantically broken.  If implementation simplicity means
>> >> literally doing queries in serial, then that is what you should do.
>> >
>> > Even legacy resolvers had the option to rotate the nameservers for
>> > load-balancing, so I think it's a stretch to say the ordering is
>> > supposed to be semantic. My view has always been that multiple
>> > nameservers in resolv.conf are for redundancy, not for serving
>> > conflicting records.
>>
>> You argued above that you should not try a secondary server in case of
>> timeout or SERVFAIL.  Obviously you would not try it on success nor
>> NXDOMAIN.  When do you see a secondary being used at all?
>
>
> If I understand correctly, what you are expecting is this:
> Try resolving from ns1.
> If that domain does not exist on ns1, try ns2.
> If that domain does not exist on ns2, report failure.
>
> The only way to do this in a consistent fashion is actually to fall through
> to the secondary if and only if you get an nxdomain from ns1.

not nxdomain - servfail.  If you get nxdomain the server is saying "I
know that this domain does not exist".  This seems to work on glibc
and busybox and every other resolver we have tried except musl.  But
again, we're not even using this functionality and probably moving to
a more "standard" model.

>> As for rotate, note that it is an option and OFF by default in every
>> mainstream resolver implementation.
>>
>> But this point is sort of academic for us - we're moving to a
>> forwarding nameserver so really there is only the primary nameserver.
>>  We just need you to ask the first nameserver first.
>>
>> >> Similarly, you can't just search all search domains in parallel and
>> >> take the first response.  The ordering is meaningful.
>> >
>> > Indeed, search domains are like that, because they inherently produce
>> > ambiguity/overlapping namespaces with different definitions. This is
>> > why myself and others who weighed in on the original question of
>> > supporting them were against, but left the option open to revisit the
>> > topic if users who need them show up.
>>
>> Yeah, I scanned the related threads.  I understand the issue in
>> theory, but in practice these are things configured by admins.  If
>> there's a conflict or ambiguity, you should solve that, not jettison
>> powerful functionality.
>>
>> >> > I also have a few questions:
>> >> >
>> >> > 1. Do you need multiple search items, or just a single domain?
>> >> >    Any setup with multiple searches necessarily has suboptimal
>> >> >    performance because ndots is not sufficient to make the right
>> >> >    initial choice of query. If you do need this functionality, a
>> >> >    unioning proxy dns server may be a better option than resolv.conf
>> >> >    domain search; it would give much better performance.
>> >>
>> >> We use multiple search paths and ndots > 1.  I'm not sure what you
>> >> mean by "unioning" here.  Search path ordering is as meaningful as
>> >> nameserver ordering.  You can't avoid making the query for each search
>> >> suffix in the worst case, and it has the same restriction as
>> >> nameserver - the search order must be respected.
>> >>
>> >> There does seem to be some different implementations that search for
>> >> the "naked" query first vs last, though.  I think the semantically
>> >> correct (but pessimal performance) is to search for that last.
>> >
>> > The traditional behavior is to do the naked query first if the query
>> > string has at least 'ndots' dots, and to do the search domains first
>> > otherwise. Also I believe a final dot always suppresses search.
>> >
>> > My point was that with ndots=1 (default) and only a single search
>> > domain, the _expected_ result is that the first query succeed. But if
>> > you have ndots>1 or multiple search domains, you expect a portion of
>> > your queries to fall back at least once. This adds significant
>> > latency.
>>
>> It adds latency, but the magnitude is very much determined by the
>> installation.  In our case it is negligible and well worth the cost.
>> I fear you're optimizing without data - it should be the site-admin's
>> problem to configure things in an acceptable way.  libc doesn't get to
>> decide what "acceptable" means.
>>
>> > In such a situation, you can avoid the additional latency (except on
>> > the first query of a given record) by running a local caching
>> > nameserver that does the search and unioning for you, rather than
>> > having the stub resolver in libc do it. Then subsequent queries
>> > succeed immediately using the cache. The reason I asked about usage
>> > case (ndots=1 vs ndots>1, single vs multiple search) is that, in the
>> > multi-fallback case, it might make more sense (from a performance and
>> > clean design standpoint) to implement this with a caching nameserver
>> > on localhost rather than in musl.
>>
>> We might be moving to a per-machine local DNS agent, which would cache
>> as you describe.  HOWEVER, there's a pretty important piece that I
>> guess I left out.  Docker and Kubernetes and similar systems run many
>> containers per machine.  Each container has a potentially different
>> search path.  I might run 100 or more containers on a single machine -
>> I can't run 100 DNS caches, and I can't put that back on users.
>
>
> Why not? 100 DNS caches shouldn't be particularly expensive.

I only have one port 53 to use on the host, and I can't force this to
run inside user containers.  I guess I could write a split-horizon
proxy that expands search paths and caches per-client.  That's
exceedingly silly given that ONLY musl doesn't work :)  It's also
somewhat wrong since I will return an address for a name the client
did not actually ask for.

>> So from our perspective the search paths MUST come from the containers
>> themselves, even if we run a machine local cache to mitigate latency
>> and SPOF.
>>
>> >> > 2. For your intended applications, is there a need to support
>> >> >    ndots>1?  Such configurations are generally not friendly to
>> >> >    applications that expect to be able to resolve normal internet
>> >> >    domain lookups, and performance for such lookups will be very bad
>> >> >    (because the search domains first have to fail).
>> >>
>> >> DNS is a very lightweight protocol.  We have not measured any
>> >> practical detriment for having 6 search domains and ndots=5.  In the
>> >> normal case it fails very quickly.  That aside, it should be my
>> >> business if I want to (mis)configure my system that way :)
>> >
>> > I suspect we have different definitions of quick... :)
>>
>> Quick is situational.  In a cloud-based mostly-webapp stack, 50ms to
>> do a name lookup ain't so bad, given the relative infrequency of that
>> operation.  Also most names actually DO resolve on the first or second
>> search path.
>>
>> >> > 3. The glibc behavior is just to swap the order of search when
>> >> >    the query string has >=ndots dots in it, but would it be
>> >> >    acceptable never to try the search domains at all in this case?
>> >> >    That would yield much better performance for nxdomain results and
>> >> >    avoid unexpected positive results due to weird subdomains
>> >> >    existing in your search domain (e.g. a wildcard for
>> >> >    *.us.example.com would cause *.us to wrongly resolve for
>> >> >    non-existant .us domains).
>> >>
>> >> I think that would be correct.  If I have 3 dots and ndots=2, search
>> >> paths should be ignored.
>> >
>> > Glad we agree on this.
>> >
>> > I hope you feel like this conversation is productive. I don't want to
>> > rule out anything/"say no" right away, but rather try to get a better
>> > understanding of your requirements first and figure out what makes the
>> > most sense to do on musl's side.
>>
>> Absolutely.  I'm happy to engage.  Obviously our use case is a bit
>> outside of what musl was really aiming for, but it offers a really
>> nice base for very efficient containers.  A lot of people want to use
>> Alpine and it breaks my heart to tell them they can't.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.