musl - TCP fallback open questions

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220916041435.GL9709@brightrain.aerifal.cx>
Date: Fri, 16 Sep 2022 00:14:36 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: TCP fallback open questions

I'm beginning to work on the long-awaited TCP fallback path in musl's
stub resolver, and have here a list of things I've found so far that
need some decisions about specifics of the behavior. For the most part
these "questions" already have a seemingly solid choice I'm leaning
towards, but since this is a topic that's complex and that's had lots
of dissatisfaction over it in the past, I want to put them out to the
community for feedback as soon as possible so that any major concerns
can be considered.



1. How to switch over in the middle of a (multi-)query:

In general, the DNS query core is performing M (1 or 2) concurrent
queries against N configured nameservers, all in parallel (N*M
in-flight operations). Any one of those might get back a truncated
response requiring fallback to TCP. We want to do this fallback in a
way that minimizes additional latency, resource usage, and logic
complexity, which may be conflicting goals.

As policy, we already assume the configured nameservers are redundant,
providing non-conflicting views of the same DNS namespace. So once we
see a single reply for a given question that requires TCP fallback,
it's reasonable to conclude that any other reply to that question
(from the other nameservers) would also require fallback, and to stop
further retries of thaat question over UDP and ignore further answers
to that question over UDP. The other question(s, but really only at
most one, the opposite A/AAAA) however may still have satisfactory UDP
answers, so ideally we want to keep listening for those, and keep
retrying them.

In principle we could end up using N*M TCP sockets for an exhaustive
parallel query. N and M are small enough that this isn't huge, but
it's also not really nice. Given that the switch to TCP was triggered
by a truncated UDP response, we already know that the responding
server *knows the answer* and just can't send it within the size
limits. So a reasonable course of action is just to open a TCP
connection to the nameserver that issued the truncated response. This
is not necessarily going to be optimal -- it's possible that another
nameserver has gotten a response in the mean time and that the
round-trip for TCP handshake and payload would be much lower to that
other server. But I'm doubtful that consuming extra kernel resources
and producing extra network load to optimize the latency here is a
reasonable tradeoff.

I'm assuming so far that each question at least would have its own TCP
connection (if truncated as UDP). Using multiple nameservers in
parallel with TCP would maybe be an option if we were doing multiple
queries on the same connection, but I'm not aware of whether TCP DNS
has any sort of "pipelining" that would make this perform reasonably.
Maybe if "priming" the question via UDP it doesn't matter though and
we could expect the queries to be processed immediately with cached
results? I don't think I like this but I'm just raising it for
completeness.

TL;DR summary: my leaning is to do one TCP connection per question
that needs fallback, to the nameserver that issued the truncated
response for the question. Does this seem reasonable? Am I overlooking
anything important?





2. Buffer shuffling:

Presently, UDP reads take place into the first unanswered buffer slot
and then get moved if it wasn't the right place. This does not seem
like it will work well when there are potentially partial TCP reads
taking place into one or more slots. I think the simplest solution is
just to use an additional fixed-size 512-byte local buffer in
__res_msend_rc for receiving UDP and always move it into the right
slot afterwards. The increased stack usage is not wonderful, but
rather small relative to the whole calling call stack, and probably
worth it to avoid code complexity. It also gives us a place to perform
throwaway TCP reads into, when reading responses longer than the
destination buffer just to report the needed length to the caller.




3. Timeouts:

UDP being datagram based, there is no condition where we have to worry
about blocking and getting stuck in the middle of a partial read.
Timeout occurs just at the loop level. 

Are there any special considerations for timeout here using TCP? My
leaning is no, since we'll still be in a poll loop regime, and
regardless of blocking state on the socket, recv should do partial
reads in the absence of MSG_WAITALL.




4. Logic for when fallback is needed:

As noted in the thread "res_query/res_send contract findings",
fallback is always needed by these functions when they get a response
with the TC bit set because of the contract to return the size needed
for the complete answer. But for high level (getaddrinfo, etc.)
lookups, it's desirable to use truncated answers when we can. What
should the condition for "when we can" be? My first leaning was that
"nonzero ANCOUNT" suffices, but for CNAMEs, it's possible that the
truncated response contains only the CNAME RR, not any records from
the A or AAAA RRset.

Some possible conditions that could be used:

- At least one RR of the type in the question. This seems to be the
  choice to make maximal use of truncated responses, but could give
  significantly fewer addresses than one might like if the nameserver
  is badly behaved or if there's a very large CNAME consuming most of
  the packet.

- No CNAME and packet size is at least 512 minus the size of one RR.
  This goes maximally in the other direction, never using results that
  might be limited by the presence of a CNAME, and ensuring we always
  have the number of answers we'd keep from a TCP response.

There are probably several other reasonable options on a spectrum
between these too.

Unless name_from_dns (lookup_name.c) is changed to use longer response
buffers, the only case in which switching to TCP will give us a better
answer is when the nameserver is being petulent in its truncation. But
it probably should be changed, since the case where the entire packet
is consumed by a CNAME can be hit. To avoid that, the buffer needs to
be at least just under 600 bytes.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.