musl - [PATCH 0/3] RISC-V: Optimize memset, memcpy and memmove

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20230607100710.4286-1-zhang_fei_0403@163.com>
Date: Wed,  7 Jun 2023 18:07:07 +0800
From: zhangfei <zhang_fei_0403@....com>
To: dalias@...c.org,
	musl@...ts.openwall.com
Cc: zhangfei <zhangfei@...iscas.ac.cn>
Subject: [PATCH 0/3] RISC-V: Optimize memset, memcpy and memmove

From: zhangfei <zhangfei@...iscas.ac.cn>

Hi,

Currently, the risc-v architecture in the kernel source code uses assembly 
implemented memset, memcpy, and memmove. As shown in the link below:

[1] https://github.com/torvalds/linux/blob/master/arch/riscv/lib/memset.S
[2] https://github.com/torvalds/linux/blob/master/arch/riscv/lib/memcpy.S
[3] https://github.com/torvalds/linux/blob/master/arch/riscv/lib/memmove.S

I have modified it to a form that can be compiled in musl. At the same time, 
I noticed that aarch64 and x86 in musl have assembly implementations of 
these functions, so I hope these patches can be integrated into musl.

memset.S refers to the handling of data volume less than 8 bytes in 
musl/src/string/memset.c, and modifies the byte storage to fill head and 
tail with minimal branching.

The original memcpy.S in the kernel uses byte-wise copy if src and dst are 
not co-aligned.This approach is not efficient enough.Therefore, the patch 
linked below was used to optimize the memcpy.S of the kernel.

[4] https://lore.kernel.org/all/20210216225555.4976-1-gary@garyguo.net/
[5] https://lore.kernel.org/all/20210513084618.2161331-1-bmeng.cn@gmail.com/

memmove.S did not make too many modifications, just made it independent of 
the kernel's header files and could be compiled separately in musl.

The testing platform selected RISC-V SiFive U74.I used the code linked below 
for performance testing.

[6] https://github.com/ARM-software/optimized-routines/blob/master/string/bench/

Compared the performance of C language in musl and assembly implementation, 
the test results are as follows:

memset.c in musl:
---------------------
Random memset (bytes/ns):
           memset_call 32K: 0.36 64K: 0.29 128K: 0.25 256K: 0.23 512K: 0.22 1024K: 0.21 avg 0.25

Medium memset (bytes/ns):
           memset_call 8B: 0.28 16B: 0.30 32B: 0.48 64B: 0.86 128B: 1.55 256B: 2.60 512B: 3.72
Large memset (bytes/ns):
           memset_call 1K: 4.83 2K: 5.40 4K: 5.85 8K: 6.09 16K: 6.22 32K: 6.15 64K: 1.39

memset.S:
---------------------
Random memset (bytes/ns):
           memset_call 32K: 0.46 64K: 0.35 128K: 0.30 256K: 0.28 512K: 0.27 1024K: 0.25 avg 0.31

Medium memset (bytes/ns):
           memset_call 8B: 0.27 16B: 0.48 32B: 0.91 64B: 1.63 128B: 2.71 256B: 4.40 512B: 5.67
Large memset (bytes/ns):
           memset_call 1K: 6.62 2K: 7.03 4K: 7.46 8K: 7.71 16K: 7.83 32K: 7.57 64K: 1.39


memcpy.c in musl:
---------------------
Random memcpy (bytes/ns):
           memcpy_call 32K: 0.24 64K: 0.20 128K: 0.18 256K: 0.17 512K: 0.16 1024K: 0.15 avg 0.18

Aligned medium memcpy (bytes/ns):
           memcpy_call 8B: 0.18 16B: 0.31 32B: 0.50 64B: 0.72 128B: 0.94 256B: 1.10 512B: 1.19

Unaligned medium memcpy (bytes/ns):
           memcpy_call 8B: 0.12 16B: 0.17 32B: 0.23 64B: 0.47 128B: 0.65 256B: 0.79 512B: 0.91

Large memcpy (bytes/ns):
           memcpy_call 1K: 1.25 2K: 1.29 4K: 1.31 8K: 1.31 16K: 1.28 32K: 0.62 64K: 0.56

memcpy.S:
---------------------
Random memcpy (bytes/ns):
           memcpy_call 32K: 0.29 64K: 0.24 128K: 0.21 256K: 0.20 512K: 0.20 1024K: 0.17 avg 0.21

Aligned medium memcpy (bytes/ns):
           memcpy_call 8B: 0.15 16B: 0.56 32B: 0.91 64B: 1.17 128B: 2.36 256B: 2.90 512B: 3.27

Unaligned medium memcpy (bytes/ns):
           memcpy_call 8B: 0.15 16B: 0.27 32B: 0.45 64B: 0.67 128B: 0.90 256B: 1.03 512B: 1.16

Large memcpy (bytes/ns):
           memcpy_call 1K: 3.49 2K: 3.55 4K: 3.65 8K: 3.69 16K: 3.54 32K: 0.87 64K: 0.75


memmove.c in musl:
---------------------
Unaligned forwards memmove (bytes/ns):
               memmove 1K: 0.22 2K: 0.22 4K: 0.22 8K: 0.23 16K: 0.23 32K: 0.22 64K: 0.20

Unaligned backwards memmove (bytes/ns):
               memmove 1K: 0.28 2K: 0.28 4K: 0.28 8K: 0.28 16K: 0.28 32K: 0.28 64K: 0.24

memmove.S:
---------------------
Unaligned forwards memmove (bytes/ns):
               memmove 1K: 1.74 2K: 1.85 4K: 1.89 8K: 1.91 16K: 1.92 32K: 1.83 64K: 0.81

Unaligned backwards memmove (bytes/ns):
               memmove 1K: 1.70 2K: 1.81 4K: 1.87 8K: 1.89 16K: 1.91 32K: 1.84 64K: 0.83

It can be seen that the basic instruction implementations of memset, memcpy, and
memmove have better performance improvements compared to the C implementation in 
musl. Please review the code.

Thanks,
Zhang Fei

zhangfei (3):
  RISC-V: Optimize memset
  RISC-V: Optimize memcpy
  RISC-V: Optimize memmove

 src/string/riscv64/memset.S  | 136 ++++++++++++++++++++++++++++++++++++
 src/string/riscv64/memcpy.S  | 159 ++++++++++++++++++++++++++++++++++++
 src/string/riscv64/memmove.S | 315 +++++++++++++++++++++++++++++++++++
 3 file changed, 610 insertions(+)
 create mode 100644 src/string/riscv64/memset.S
 create mode 100644 src/string/riscv64/memcpy.S
 create mode 100644 src/string/riscv64/memmove.S
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.