Date: Wed, 1 Dec 2021 15:57:48 +0000 (GMT) From: Mark Hills <mark@...x.org> To: Rich Felker <dalias@...c.org> cc: musl@...ts.openwall.com Subject: Re: DNS resolver fails prematurely when server reports failure? On Wed, 1 Dec 2021, Rich Felker wrote: > On Wed, Dec 01, 2021 at 12:49:07PM +0000, Mark Hills wrote: > > With multiple DNS servers in /etc/resolv.conf, the docs  are clear: > > > > "musl's resolver queries them all in parallel and accepts whichever > > response arrives first." > > > > So dual configuration is expected to give greater resiliancy: > > > > nameserver 188.8.131.52 # OVH > > nameserver 184.108.40.206 # Cloudflare > > > > However, 220.127.116.11 appears quite prone to some kind of internal SERVFAIL > > (may be internal load shedding; though we are not making excessive DNS > > queries) > > > > With glibc's cascading behaviour (or perhaps another OS) this may be dealt > > with by the client. > > > > But if the wiki is read literally, the first response received is "this > > server has failed" then a good response from another server is ignored? > > No. ServFail is an inconclusive response, treated basically the same > as if no packet had arrived at all. (Slight difference: it triggers > immediate retry up to a limited number of times.) Ok, thanks. That sounds correct, and I realise now that the real process of the query is in this source file  which is why the code looked so opaque. Could it be better to make a small change to the wiki text? Perhaps "conclusive answer" instead of "response": accepts whichever conclusive answer arrives first > > And indeed this seems to be the behaviour we experience, as removing > > 18.104.22.168 restored reliability. > > Have you looked at a packet capture of what's happening? Likely > 22.214.171.124 was returning a false conclusive result (NxDomain or NODATA) > rather than ServFail. We caught the problem with a tcpdump (which is first how we realised the differing behaviour between the man page and musl), and reproduced it with "dig", however it doesn't seem to be reproducable now. My recollection is that was an instant response and where I first encountered "ServFail" but I'll see if we have logged the actual run. I'm _fairly_ sure I'd have noticed a false but conclusive response. I'm re-adding the "backup" DNS on a test system to see if we can get back to reproducing the problem.  https://git.musl-libc.org/cgit/musl/tree/src/network/res_msend.c#n30 Thanks -- Mark
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.