Date: Mon, 23 Feb 2015 20:09:52 -0500 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Cc: Denys Vlasenko <vda.linux@...glemail.com> Subject: Draft of improved memset.s for i386 Here's a draft of an improved i386 memset.s based on the principles Denys Vlasenko and I discussed on his and my x86_64 versions. Compared to the current code, it reduces entry/exit overhead, increases the length supported in the non-rep-stosl path, and aligns the rep-stosl. My tests don't measure the misalignment penalty, but even in the aligned case the rep-stosl path is slightly faster (~5 cycles per run, out of at least 64 cycles and the non-rep-stosl path is significantly faster (e.g. 33 vs 51 cycles at size 16 and 40 vs 57 at size 32). Empirically the byte-register-access/left-shift method of extending the fill value to a word performs better than imul for me, but the margin is very small (at most 1 cycle). Since we support much older cpus (like actual 486) where imul could be really slow, I think this is the right approach in principle too. I used imul in the rep-stosl path but haven't tested whether it's faster there. The non-rep-stosl path only goes up to size 62. I think sizes up to 126 could benefit from it, but the string of stores was getting really long. Correctness has not been tested so there may be stupid bugs. Rich View attachment "memset-draft.s" of type "text/plain" (1092 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.