Date: Tue, 30 Apr 2013 11:40:45 -0400 From: Rich Felker <dalias@...ifal.cx> To: musl@...ts.openwall.com Subject: Re: Word-sized reads access memory past the bound of objects On Tue, Apr 30, 2013 at 05:11:14PM +0200, Jonas Wagner wrote: > Hi, > > I'm currently experimenting with MUSL and automated bug finding tools. One > issue I'm facing is that the tool reports several errors in functions such > as strlen, that perform word-size accesses. What happens is that strlen > reads a word at a time, then checks whether there is a zero in there. If > the zero happens to be in the first byte, it thus reads three bytes past > the end of the string. > > In principle, the tool is correct and MUSL does cause undefined behavior Yes and no. The "underlying freestanding implementation" musl assumes and is built on has a representation arrays for all of mapped memory in page-size units with mapping properties/permissions on page granularity. However, testing and analysis tools might offer a more restrictive underlying model. > here. In practice, I don't see a way how MUSL's behavior could cause any > damage... Read-only accesses aligned to the size of the access, and where the initial byte is accessible, can never fault under the assumed memory model. > My questions are: > - How prevalent is such code in MUSL? Not very. Probably src/string and src/multibyte are the only places. > - Would there be an easy way to find all these places and change them? The tool you're using is probably the best way. Or, any static analysis that can detect conversions (even indirect) from character pointer types to a pointer to a non-character type. > - Are there other types of "soft" undefined behavior that MUSL exploits? I don't think so. The closest things I can think of: - UTF-8 code depends on sign-extending right-shift. This could be easily fixed if it can be verified that the standard trick to work around it generates the same (or equally efficient) code. Note this is implementation-defined, not undefined. - Floating point conversion to/from strings depends on IEEE arithmetic properties and on long double being an IEEE conforming type. (x87 ld80 is fine, so is IEEE quad, but IBM double-double will not work, and systems that typically use IBM double-double should instead have their compiler configured for 64-bit long double instead.) - calloc assumes its own implementation of malloc. Compilers and analysis tools which assume negative offsets from the pointer returned by malloc are invalid will falsely detect problems and/or miscompile calloc.c. This issue affected old versions of clang. - The dynamic linker also makes some assumptions about the implementation of malloc and passes pointers not obtained by malloc to free, as part of its mechanism to reclaim wasted slack space in shared libraries due to page alignment. - POSIX timers with SIGEV_THREAD perform a longjmp out of a cancellation handler to intercept cancellation/exit so the same physical thread can be kept to handle the next timer expiration. For an application to do this would be UB (at the POSIX level, not the C level) but since they're both part of the same implementation they can assume things about each other. That's all that comes to mind right now. Thanks for bringing up this question, because it's something that should be documented in case people want to reuse parts of musl in contexts where some of the assumptions may no longer be valid. > I guess doing changing MUSL would lose a lot of performance... so maybe > I'll adapt the bug finding tool instead... Maybe. With a compiler that can do vectorization and a machine with vector instructions, the "naive" versions of these functions can be just as fast in practice, and perhaps even faster in theory. The big problem is that gcc won't vectorize 4 byte accesses into a 32-bit word in a normal 32-bit register, even though it could... Maybe in the long term this won't matter if we have asm for the important archs without vector ops...? Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.