Date: Fri, 16 Sep 2022 00:14:36 -0400 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: TCP fallback open questions I'm beginning to work on the long-awaited TCP fallback path in musl's stub resolver, and have here a list of things I've found so far that need some decisions about specifics of the behavior. For the most part these "questions" already have a seemingly solid choice I'm leaning towards, but since this is a topic that's complex and that's had lots of dissatisfaction over it in the past, I want to put them out to the community for feedback as soon as possible so that any major concerns can be considered. 1. How to switch over in the middle of a (multi-)query: In general, the DNS query core is performing M (1 or 2) concurrent queries against N configured nameservers, all in parallel (N*M in-flight operations). Any one of those might get back a truncated response requiring fallback to TCP. We want to do this fallback in a way that minimizes additional latency, resource usage, and logic complexity, which may be conflicting goals. As policy, we already assume the configured nameservers are redundant, providing non-conflicting views of the same DNS namespace. So once we see a single reply for a given question that requires TCP fallback, it's reasonable to conclude that any other reply to that question (from the other nameservers) would also require fallback, and to stop further retries of thaat question over UDP and ignore further answers to that question over UDP. The other question(s, but really only at most one, the opposite A/AAAA) however may still have satisfactory UDP answers, so ideally we want to keep listening for those, and keep retrying them. In principle we could end up using N*M TCP sockets for an exhaustive parallel query. N and M are small enough that this isn't huge, but it's also not really nice. Given that the switch to TCP was triggered by a truncated UDP response, we already know that the responding server *knows the answer* and just can't send it within the size limits. So a reasonable course of action is just to open a TCP connection to the nameserver that issued the truncated response. This is not necessarily going to be optimal -- it's possible that another nameserver has gotten a response in the mean time and that the round-trip for TCP handshake and payload would be much lower to that other server. But I'm doubtful that consuming extra kernel resources and producing extra network load to optimize the latency here is a reasonable tradeoff. I'm assuming so far that each question at least would have its own TCP connection (if truncated as UDP). Using multiple nameservers in parallel with TCP would maybe be an option if we were doing multiple queries on the same connection, but I'm not aware of whether TCP DNS has any sort of "pipelining" that would make this perform reasonably. Maybe if "priming" the question via UDP it doesn't matter though and we could expect the queries to be processed immediately with cached results? I don't think I like this but I'm just raising it for completeness. TL;DR summary: my leaning is to do one TCP connection per question that needs fallback, to the nameserver that issued the truncated response for the question. Does this seem reasonable? Am I overlooking anything important? 2. Buffer shuffling: Presently, UDP reads take place into the first unanswered buffer slot and then get moved if it wasn't the right place. This does not seem like it will work well when there are potentially partial TCP reads taking place into one or more slots. I think the simplest solution is just to use an additional fixed-size 512-byte local buffer in __res_msend_rc for receiving UDP and always move it into the right slot afterwards. The increased stack usage is not wonderful, but rather small relative to the whole calling call stack, and probably worth it to avoid code complexity. It also gives us a place to perform throwaway TCP reads into, when reading responses longer than the destination buffer just to report the needed length to the caller. 3. Timeouts: UDP being datagram based, there is no condition where we have to worry about blocking and getting stuck in the middle of a partial read. Timeout occurs just at the loop level. Are there any special considerations for timeout here using TCP? My leaning is no, since we'll still be in a poll loop regime, and regardless of blocking state on the socket, recv should do partial reads in the absence of MSG_WAITALL. 4. Logic for when fallback is needed: As noted in the thread "res_query/res_send contract findings", fallback is always needed by these functions when they get a response with the TC bit set because of the contract to return the size needed for the complete answer. But for high level (getaddrinfo, etc.) lookups, it's desirable to use truncated answers when we can. What should the condition for "when we can" be? My first leaning was that "nonzero ANCOUNT" suffices, but for CNAMEs, it's possible that the truncated response contains only the CNAME RR, not any records from the A or AAAA RRset. Some possible conditions that could be used: - At least one RR of the type in the question. This seems to be the choice to make maximal use of truncated responses, but could give significantly fewer addresses than one might like if the nameserver is badly behaved or if there's a very large CNAME consuming most of the packet. - No CNAME and packet size is at least 512 minus the size of one RR. This goes maximally in the other direction, never using results that might be limited by the presence of a CNAME, and ensuring we always have the number of answers we'd keep from a TCP response. There are probably several other reasonable options on a spectrum between these too. Unless name_from_dns (lookup_name.c) is changed to use longer response buffers, the only case in which switching to TCP will give us a better answer is when the nameserver is being petulent in its truncation. But it probably should be changed, since the case where the entire packet is consumed by a CNAME can be hit. To avoid that, the buffer needs to be at least just under 600 bytes.
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.