Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <20251023155353.175632-2-pincheng.plct@isrc.iscas.ac.cn>
Date: Thu, 23 Oct 2025 23:53:53 +0800
From: Pincheng Wang <pincheng.plct@...c.iscas.ac.cn>
To: musl@...ts.openwall.com
Cc: pincheng.plct@...c.iscas.ac.cn
Subject: [PATCH v2 resend 1/1] riscv64: add runtime-detected vector optimized memset

Add a RISC-V vector extension optimized memset implementation with
runtime CPU capability detection via HW_CAP.

The implementation provides both vector and scalar variants in a single
binary. At process startup, __init_riscv_string_optimizations() queries
AT_HWCAP to detect RVV support and selects the appropriate
implementation via function pointer dispatch. This allows the same libc
to run correctly on both vector-capable and non-vector RISC-V CPUs.

The vector implementation uses vsetvli for dynamic vector length and
employs a head-tail filling strategy for small sizes to minimize
overhead.

To prevent illegal instruction error, arch.mak disables compiler
auto-vectorization globally except for memset.S, ensuring only the
runtime-detected code uses vector instructions.

Signed-off-by: Pincheng Wang <pincheng.plct@...c.iscas.ac.cn>
---
 0000-cover-letter.patch                       |  79 ++++++
 0000-cover-letter.patch.bak                   |  79 ++++++
 ...ime-detected-vector-optimized-memset.patch | 259 ++++++++++++++++++
 arch/riscv64/arch.mak                         |  12 +
 src/env/__libc_start_main.c                   |   3 +
 src/internal/libc.h                           |   1 +
 src/string/riscv64/memset.S                   | 134 +++++++++
 src/string/riscv64/memset_dispatch.c          |  38 +++
 test_memset                                   | Bin 0 -> 8824 bytes
 9 files changed, 605 insertions(+)
 create mode 100644 0000-cover-letter.patch
 create mode 100644 0000-cover-letter.patch.bak
 create mode 100644 0001-riscv64-add-runtime-detected-vector-optimized-memset.patch
 create mode 100644 arch/riscv64/arch.mak
 create mode 100644 src/string/riscv64/memset.S
 create mode 100644 src/string/riscv64/memset_dispatch.c
 create mode 100755 test_memset

diff --git a/0000-cover-letter.patch b/0000-cover-letter.patch
new file mode 100644
index 00000000..83ebe09e
--- /dev/null
+++ b/0000-cover-letter.patch
@@ -0,0 +1,79 @@
+From f54ffd5fabd469a4dc4a6631b497d58cf5663cf8 Mon Sep 17 00:00:00 2001
+From: Pincheng Wang <pincheng.plct@...c.iscas.ac.cn>
+Date: Thu, 23 Oct 2025 21:15:28 +0800
+Subject: [PATCH v2 0/1] riscv64: Add RVV optimized memset implementation
+
+Hi all,
+
+This is v2 of the RISC-V Vector (RVV) optimized memset patch.
+
+Changes from v1:
+- Replaced compile-time detetion (__riscv_vector macro) with runtime
+  detection using AT_HWCAP, addressing the main concern from v1
+  feedback.
+- Introduced a dispatch mechanism (memset_dispatch.c) that selects
+  appropriate implementation at process startup.
+- Added arch.mak configuration to prevent GCC auto-vectorization on
+  other string functions, ensuring only our runtime-detected code uses
+  vector insturctions.
+- Single binary now works correctly on both RVV and non-RVV hardware
+  when built with CFLAGS+="-march=rv64gcv".
+
+Implementation details:
+- memset.S provides two symbols: memset_vect (RVV) and memset_scalar.
+- memset_dispatch.c exports memset() which dispatches via function
+  pointer.
+- __init_riscv_string_optimizations() is called in __libc_start_main to
+  initialize the function pointer based on AT_HWCAP.
+- The vector implementation uses vsetvli for bulk fills and a head-tail
+  strategy for small sizes.
+
+Performance (unchanged from v1):
+- On Spacemit X60: up to ~3.1x faster (256B), with consistent gains
+  across medium and large sizes.
+- On XuanTie C908: up to ~2.1x faster (128B), with modest gains for
+  larger sizes.
+
+For very small sizes (<8 bytes), there can be minor regressions compared
+to the generic C version. This is a trade off for the significant gains
+on larger sizes.
+
+Additional Notes:
+It was mentioned during internal discussions that, according to the XuanTie C908 user
+manuals [1], vector memory operations must not target addresses with
+Strong Order (SO) attributes, otherwise the system may crash. In the
+current implementation, this scenario is handled by OpenSBI fallback
+mechanisms, but this leads to degraded performance compared to scalar
+implementations.
+I reviewed other existing vectorized mem* patches and did not find
+explicit handling for this case. Introducing explicit attribute checks
+would likely add extra overhead. Therefore, I am currently uncertain
+whether special handling for XuanTie CPUs should be included in this
+patch or addressed separately.
+
+Testing:
+- QEMU with QEMU_CPU="rv64,v=true" and "rv64,v=false".
+- Spacemit X60 with V extension support.
+- XuanTie C908 with V extension support.
+- CFLAGS += "-march=rv64gcv" and "-march=rv64gc".
+Functional behavior matches generic memset.
+
+Thanks,
+Pincheng Wang
+
+Pincheng Wang (1):
+  riscv64: add runtime-detected vector optimized memset
+
+ arch/riscv64/arch.mak                |  12 +++
+ src/env/__libc_start_main.c          |   3 +
+ src/internal/libc.h                  |   1 +
+ src/string/riscv64/memset.S          | 134 +++++++++++++++++++++++++++
+ src/string/riscv64/memset_dispatch.c |  32 +++++++
+ 5 files changed, 182 insertions(+)
+ create mode 100644 arch/riscv64/arch.mak
+ create mode 100644 src/string/riscv64/memset.S
+ create mode 100644 src/string/riscv64/memset_dispatch.c
+
+-- 
+2.39.5
+
diff --git a/0000-cover-letter.patch.bak b/0000-cover-letter.patch.bak
new file mode 100644
index 00000000..83ebe09e
--- /dev/null
+++ b/0000-cover-letter.patch.bak
@@ -0,0 +1,79 @@
+From f54ffd5fabd469a4dc4a6631b497d58cf5663cf8 Mon Sep 17 00:00:00 2001
+From: Pincheng Wang <pincheng.plct@...c.iscas.ac.cn>
+Date: Thu, 23 Oct 2025 21:15:28 +0800
+Subject: [PATCH v2 0/1] riscv64: Add RVV optimized memset implementation
+
+Hi all,
+
+This is v2 of the RISC-V Vector (RVV) optimized memset patch.
+
+Changes from v1:
+- Replaced compile-time detetion (__riscv_vector macro) with runtime
+  detection using AT_HWCAP, addressing the main concern from v1
+  feedback.
+- Introduced a dispatch mechanism (memset_dispatch.c) that selects
+  appropriate implementation at process startup.
+- Added arch.mak configuration to prevent GCC auto-vectorization on
+  other string functions, ensuring only our runtime-detected code uses
+  vector insturctions.
+- Single binary now works correctly on both RVV and non-RVV hardware
+  when built with CFLAGS+="-march=rv64gcv".
+
+Implementation details:
+- memset.S provides two symbols: memset_vect (RVV) and memset_scalar.
+- memset_dispatch.c exports memset() which dispatches via function
+  pointer.
+- __init_riscv_string_optimizations() is called in __libc_start_main to
+  initialize the function pointer based on AT_HWCAP.
+- The vector implementation uses vsetvli for bulk fills and a head-tail
+  strategy for small sizes.
+
+Performance (unchanged from v1):
+- On Spacemit X60: up to ~3.1x faster (256B), with consistent gains
+  across medium and large sizes.
+- On XuanTie C908: up to ~2.1x faster (128B), with modest gains for
+  larger sizes.
+
+For very small sizes (<8 bytes), there can be minor regressions compared
+to the generic C version. This is a trade off for the significant gains
+on larger sizes.
+
+Additional Notes:
+It was mentioned during internal discussions that, according to the XuanTie C908 user
+manuals [1], vector memory operations must not target addresses with
+Strong Order (SO) attributes, otherwise the system may crash. In the
+current implementation, this scenario is handled by OpenSBI fallback
+mechanisms, but this leads to degraded performance compared to scalar
+implementations.
+I reviewed other existing vectorized mem* patches and did not find
+explicit handling for this case. Introducing explicit attribute checks
+would likely add extra overhead. Therefore, I am currently uncertain
+whether special handling for XuanTie CPUs should be included in this
+patch or addressed separately.
+
+Testing:
+- QEMU with QEMU_CPU="rv64,v=true" and "rv64,v=false".
+- Spacemit X60 with V extension support.
+- XuanTie C908 with V extension support.
+- CFLAGS += "-march=rv64gcv" and "-march=rv64gc".
+Functional behavior matches generic memset.
+
+Thanks,
+Pincheng Wang
+
+Pincheng Wang (1):
+  riscv64: add runtime-detected vector optimized memset
+
+ arch/riscv64/arch.mak                |  12 +++
+ src/env/__libc_start_main.c          |   3 +
+ src/internal/libc.h                  |   1 +
+ src/string/riscv64/memset.S          | 134 +++++++++++++++++++++++++++
+ src/string/riscv64/memset_dispatch.c |  32 +++++++
+ 5 files changed, 182 insertions(+)
+ create mode 100644 arch/riscv64/arch.mak
+ create mode 100644 src/string/riscv64/memset.S
+ create mode 100644 src/string/riscv64/memset_dispatch.c
+
+-- 
+2.39.5
+
diff --git a/0001-riscv64-add-runtime-detected-vector-optimized-memset.patch b/0001-riscv64-add-runtime-detected-vector-optimized-memset.patch
new file mode 100644
index 00000000..c0596073
--- /dev/null
+++ b/0001-riscv64-add-runtime-detected-vector-optimized-memset.patch
@@ -0,0 +1,259 @@
+From f54ffd5fabd469a4dc4a6631b497d58cf5663cf8 Mon Sep 17 00:00:00 2001
+From: Pincheng Wang <pincheng.plct@...c.iscas.ac.cn>
+Date: Thu, 23 Oct 2025 21:15:00 +0800
+Subject: [PATCH v2 1/1] riscv64: add runtime-detected vector optimized memset
+
+Add a RISC-V vector extension optimized memset implementation with
+runtime CPU capability detection via HW_CAP.
+
+The implementation provides both vector and scalar variants in a single
+binary. At process startup, __init_riscv_string_optimizations() queries
+AT_HWCAP to detect RVV support and selects the appropriate
+implementation via function pointer dispatch. This allows the same libc
+to run correctly on both vector-capable and non-vector RISC-V CPUs.
+
+The vector implementation uses vsetvli for dynamic vector length and
+employs a head-tail filling strategy for small sizes to minimize
+overhead.
+
+To prevent illegal instruction error, arch.mak disables compiler
+auto-vectorization globally except for memset.S, ensuring only the
+runtime-detected code uses vector instructions.
+
+Signed-off-by: Pincheng Wang <pincheng.plct@...c.iscas.ac.cn>
+---
+ arch/riscv64/arch.mak                |  12 +++
+ src/env/__libc_start_main.c          |   3 +
+ src/internal/libc.h                  |   1 +
+ src/string/riscv64/memset.S          | 134 +++++++++++++++++++++++++++
+ src/string/riscv64/memset_dispatch.c |  32 +++++++
+ 5 files changed, 182 insertions(+)
+ create mode 100644 arch/riscv64/arch.mak
+ create mode 100644 src/string/riscv64/memset.S
+ create mode 100644 src/string/riscv64/memset_dispatch.c
+
+diff --git a/arch/riscv64/arch.mak b/arch/riscv64/arch.mak
+new file mode 100644
+index 00000000..5978eb0a
+--- /dev/null
++++ b/arch/riscv64/arch.mak
+@@ -0,0 +1,12 @@
++# Disable tree vectorization for all files except memset.S
++
++# Reason: We have hand-optimized vector memset.S that uses runtime detection
++# to switch between scalar and vector implementations based on CPU capability.
++# However, GCC may auto-vectorize other functions (like memcpy, strcpy, etc.)
++# which would cause illegal instruction errors on CPUs without vector extensions.
++
++# Therefore, we disable auto-vectorization for all files except memset.S,
++# ensuring only our runtime-detected vector code uses vector instructions.
++
++# Add -fno-tree-vectorize to all object files except memset.S
++$(filter-out obj/src/string/riscv64/memset.o obj/src/string/riscv64/memset.lo, $(ALL_OBJS) $(LOBJS)): CFLAGS_ALL += -fno-tree-vectorize
+diff --git a/src/env/__libc_start_main.c b/src/env/__libc_start_main.c
+index c5b277bd..c23e63f4 100644
+--- a/src/env/__libc_start_main.c
++++ b/src/env/__libc_start_main.c
+@@ -38,6 +38,9 @@ void __init_libc(char **envp, char *pn)
+ 
+ 	__init_tls(aux);
+ 	__init_ssp((void *)aux[AT_RANDOM]);
++#ifdef __riscv
++	__init_riscv_string_optimizations();
++#endif
+ 
+ 	if (aux[AT_UID]==aux[AT_EUID] && aux[AT_GID]==aux[AT_EGID]
+ 		&& !aux[AT_SECURE]) return;
+diff --git a/src/internal/libc.h b/src/internal/libc.h
+index 619bba86..28e893a1 100644
+--- a/src/internal/libc.h
++++ b/src/internal/libc.h
+@@ -40,6 +40,7 @@ extern hidden struct __libc __libc;
+ hidden void __init_libc(char **, char *);
+ hidden void __init_tls(size_t *);
+ hidden void __init_ssp(void *);
++hidden void __init_riscv_string_optimizations(void);
+ hidden void __libc_start_init(void);
+ hidden void __funcs_on_exit(void);
+ hidden void __funcs_on_quick_exit(void);
+diff --git a/src/string/riscv64/memset.S b/src/string/riscv64/memset.S
+new file mode 100644
+index 00000000..8568f32b
+--- /dev/null
++++ b/src/string/riscv64/memset.S
+@@ -0,0 +1,134 @@
++#ifdef __riscv_vector
++
++    .text
++    .global memset_vect
++/* void *memset_vect(void *s, int c, size_t n)
++ * a0 = s (dest), a1 = c (fill byte), a2 = n (size)
++ * Returns a0.
++ */
++memset_vect:
++    mv      t0, a0                    /* running dst; keep a0 as return */
++    beqz    a2, .Ldone_vect           /* n == 0 then return */
++
++    li      t3, 8
++    bltu    a2, t3, .Lsmall_vect      /* small-size fast path */
++
++    /* Broadcast fill byte once. */
++    vsetvli t1, zero, e8, m8, ta, ma
++    vmv.v.x v0, a1
++
++.Lbulk_vect:
++    vsetvli t1, a2, e8, m8, ta, ma    /* t1 = vl (bytes) */
++    vse8.v  v0, (t0)
++    add     t0, t0, t1
++    sub     a2, a2, t1
++    bnez    a2, .Lbulk_vect
++    j       .Ldone_vect
++
++/* Small-size fast path (< 8).
++ * Head-tail fills to minimize branches and avoid vsetvli overhead.
++ */
++.Lsmall_vect:
++    /* Fill s[0], s[n-1] */
++    sb      a1, 0(t0)
++    add     t2, t0, a2
++    sb      a1, -1(t2)
++    li      t3, 2
++    bleu    a2, t3, .Ldone_vect
++
++    /* Fill s[1], s[2], s[n-2], s[n-3] */
++    sb      a1, 1(t0)
++    sb      a1, 2(t0)
++    sb      a1, -2(t2)
++    sb      a1, -3(t2)
++    li      t3, 6
++    bleu    a2, t3, .Ldone_vect
++
++    /* Fill s[3], s[n-4] */
++    sb      a1, 3(t0)
++    sb      a1, -4(t2)
++    /* fallthrough for n <= 8 */
++
++.Ldone_vect:
++    ret
++.size memset_vect, .-memset_vect
++
++    .text
++    .global memset_scalar
++memset_scalar:
++    mv      t0, a0                    
++    beqz    a2, .Ldone_scalar
++
++    andi    a1, a1, 0xff              
++
++    sb      a1, 0(t0)                 
++    add     t2, t0, a2
++    sb      a1, -1(t2)                
++    li      t3, 2
++    bleu    a2, t3, .Ldone_scalar
++
++    sb      a1, 1(t0)
++    sb      a1, 2(t0)
++    sb      a1, -2(t2)
++    sb      a1, -3(t2)
++    li      t3, 6
++    bleu    a2, t3, .Ldone_scalar
++
++    sb      a1, 3(t0)
++    sb      a1, -4(t2)
++    li      t3, 8
++    bleu    a2, t3, .Ldone_scalar
++
++    addi    t4, t0, 4
++    addi    t5, t2, -4
++.Lloop_scalar:
++    bgeu    t4, t5, .Ldone_scalar
++    sb      a1, 0(t4)
++    addi    t4, t4, 1
++    j       .Lloop_scalar
++
++.Ldone_scalar:
++    ret
++.size memset_scalar, .-memset_scalar
++
++#else
++
++    .text
++    .global memset_scalar
++memset_scalar:
++    mv      t0, a0                    
++    beqz    a2, .Ldone_scalar
++
++    andi    a1, a1, 0xff              
++
++    sb      a1, 0(t0)                 
++    add     t2, t0, a2
++    sb      a1, -1(t2)                
++    li      t3, 2
++    bleu    a2, t3, .Ldone_scalar
++
++    sb      a1, 1(t0)
++    sb      a1, 2(t0)
++    sb      a1, -2(t2)
++    sb      a1, -3(t2)
++    li      t3, 6
++    bleu    a2, t3, .Ldone_scalar
++
++    sb      a1, 3(t0)
++    sb      a1, -4(t2)
++    li      t3, 8
++    bleu    a2, t3, .Ldone_scalar
++
++    addi    t4, t0, 4
++    addi    t5, t2, -4
++.Lloop_scalar:
++    bgeu    t4, t5, .Ldone_scalar
++    sb      a1, 0(t4)
++    addi    t4, t4, 1
++    j       .Lloop_scalar
++
++.Ldone_scalar:
++    ret
++.size memset_scalar, .-memset_scalar
++
++#endif
+diff --git a/src/string/riscv64/memset_dispatch.c b/src/string/riscv64/memset_dispatch.c
+new file mode 100644
+index 00000000..c8648177
+--- /dev/null
++++ b/src/string/riscv64/memset_dispatch.c
+@@ -0,0 +1,32 @@
++#include "libc.h"
++#include <stddef.h>
++#include <stdint.h>
++#include <sys/auxv.h>
++
++void *memset_scalar(void *s, int c, size_t n);
++void *memset_vect(void *s, int c, size_t n);
++
++/* Use scalar implementation by default */
++__attribute__((visibility("hidden")))
++void *(*__memset_ptr)(void *, int, size_t) = memset_scalar;
++
++void *memset(void *s, int c, size_t n)
++{
++	return __memset_ptr(s, c, n);
++}
++
++static int __has_rvv_via_hwcap(void)
++{
++	const unsigned long V_bit = (1ul << ('V' - 'A'));
++	unsigned long hwcap = getauxval(AT_HWCAP);
++	return (hwcap & V_bit) != 0;
++}
++
++__attribute__((visibility("hidden")))
++void __init_riscv_string_optimizations(void)
++{
++	if (__has_rvv_via_hwcap())
++		__memset_ptr = memset_vect;
++	else
++		__memset_ptr = memset_scalar;
++}
+-- 
+2.39.5
+
diff --git a/arch/riscv64/arch.mak b/arch/riscv64/arch.mak
new file mode 100644
index 00000000..5978eb0a
--- /dev/null
+++ b/arch/riscv64/arch.mak
@@ -0,0 +1,12 @@
+# Disable tree vectorization for all files except memset.S
+
+# Reason: We have hand-optimized vector memset.S that uses runtime detection
+# to switch between scalar and vector implementations based on CPU capability.
+# However, GCC may auto-vectorize other functions (like memcpy, strcpy, etc.)
+# which would cause illegal instruction errors on CPUs without vector extensions.
+
+# Therefore, we disable auto-vectorization for all files except memset.S,
+# ensuring only our runtime-detected vector code uses vector instructions.
+
+# Add -fno-tree-vectorize to all object files except memset.S
+$(filter-out obj/src/string/riscv64/memset.o obj/src/string/riscv64/memset.lo, $(ALL_OBJS) $(LOBJS)): CFLAGS_ALL += -fno-tree-vectorize
diff --git a/src/env/__libc_start_main.c b/src/env/__libc_start_main.c
index c5b277bd..c23e63f4 100644
--- a/src/env/__libc_start_main.c
+++ b/src/env/__libc_start_main.c
@@ -38,6 +38,9 @@ void __init_libc(char **envp, char *pn)
 
 	__init_tls(aux);
 	__init_ssp((void *)aux[AT_RANDOM]);
+#ifdef __riscv
+	__init_riscv_string_optimizations();
+#endif
 
 	if (aux[AT_UID]==aux[AT_EUID] && aux[AT_GID]==aux[AT_EGID]
 		&& !aux[AT_SECURE]) return;
diff --git a/src/internal/libc.h b/src/internal/libc.h
index 619bba86..28e893a1 100644
--- a/src/internal/libc.h
+++ b/src/internal/libc.h
@@ -40,6 +40,7 @@ extern hidden struct __libc __libc;
 hidden void __init_libc(char **, char *);
 hidden void __init_tls(size_t *);
 hidden void __init_ssp(void *);
+hidden void __init_riscv_string_optimizations(void);
 hidden void __libc_start_init(void);
 hidden void __funcs_on_exit(void);
 hidden void __funcs_on_quick_exit(void);
diff --git a/src/string/riscv64/memset.S b/src/string/riscv64/memset.S
new file mode 100644
index 00000000..8568f32b
--- /dev/null
+++ b/src/string/riscv64/memset.S
@@ -0,0 +1,134 @@
+#ifdef __riscv_vector
+
+    .text
+    .global memset_vect
+/* void *memset_vect(void *s, int c, size_t n)
+ * a0 = s (dest), a1 = c (fill byte), a2 = n (size)
+ * Returns a0.
+ */
+memset_vect:
+    mv      t0, a0                    /* running dst; keep a0 as return */
+    beqz    a2, .Ldone_vect           /* n == 0 then return */
+
+    li      t3, 8
+    bltu    a2, t3, .Lsmall_vect      /* small-size fast path */
+
+    /* Broadcast fill byte once. */
+    vsetvli t1, zero, e8, m8, ta, ma
+    vmv.v.x v0, a1
+
+.Lbulk_vect:
+    vsetvli t1, a2, e8, m8, ta, ma    /* t1 = vl (bytes) */
+    vse8.v  v0, (t0)
+    add     t0, t0, t1
+    sub     a2, a2, t1
+    bnez    a2, .Lbulk_vect
+    j       .Ldone_vect
+
+/* Small-size fast path (< 8).
+ * Head-tail fills to minimize branches and avoid vsetvli overhead.
+ */
+.Lsmall_vect:
+    /* Fill s[0], s[n-1] */
+    sb      a1, 0(t0)
+    add     t2, t0, a2
+    sb      a1, -1(t2)
+    li      t3, 2
+    bleu    a2, t3, .Ldone_vect
+
+    /* Fill s[1], s[2], s[n-2], s[n-3] */
+    sb      a1, 1(t0)
+    sb      a1, 2(t0)
+    sb      a1, -2(t2)
+    sb      a1, -3(t2)
+    li      t3, 6
+    bleu    a2, t3, .Ldone_vect
+
+    /* Fill s[3], s[n-4] */
+    sb      a1, 3(t0)
+    sb      a1, -4(t2)
+    /* fallthrough for n <= 8 */
+
+.Ldone_vect:
+    ret
+.size memset_vect, .-memset_vect
+
+    .text
+    .global memset_scalar
+memset_scalar:
+    mv      t0, a0                    
+    beqz    a2, .Ldone_scalar
+
+    andi    a1, a1, 0xff              
+
+    sb      a1, 0(t0)                 
+    add     t2, t0, a2
+    sb      a1, -1(t2)                
+    li      t3, 2
+    bleu    a2, t3, .Ldone_scalar
+
+    sb      a1, 1(t0)
+    sb      a1, 2(t0)
+    sb      a1, -2(t2)
+    sb      a1, -3(t2)
+    li      t3, 6
+    bleu    a2, t3, .Ldone_scalar
+
+    sb      a1, 3(t0)
+    sb      a1, -4(t2)
+    li      t3, 8
+    bleu    a2, t3, .Ldone_scalar
+
+    addi    t4, t0, 4
+    addi    t5, t2, -4
+.Lloop_scalar:
+    bgeu    t4, t5, .Ldone_scalar
+    sb      a1, 0(t4)
+    addi    t4, t4, 1
+    j       .Lloop_scalar
+
+.Ldone_scalar:
+    ret
+.size memset_scalar, .-memset_scalar
+
+#else
+
+    .text
+    .global memset_scalar
+memset_scalar:
+    mv      t0, a0                    
+    beqz    a2, .Ldone_scalar
+
+    andi    a1, a1, 0xff              
+
+    sb      a1, 0(t0)                 
+    add     t2, t0, a2
+    sb      a1, -1(t2)                
+    li      t3, 2
+    bleu    a2, t3, .Ldone_scalar
+
+    sb      a1, 1(t0)
+    sb      a1, 2(t0)
+    sb      a1, -2(t2)
+    sb      a1, -3(t2)
+    li      t3, 6
+    bleu    a2, t3, .Ldone_scalar
+
+    sb      a1, 3(t0)
+    sb      a1, -4(t2)
+    li      t3, 8
+    bleu    a2, t3, .Ldone_scalar
+
+    addi    t4, t0, 4
+    addi    t5, t2, -4
+.Lloop_scalar:
+    bgeu    t4, t5, .Ldone_scalar
+    sb      a1, 0(t4)
+    addi    t4, t4, 1
+    j       .Lloop_scalar
+
+.Ldone_scalar:
+    ret
+.size memset_scalar, .-memset_scalar
+
+#endif
diff --git a/src/string/riscv64/memset_dispatch.c b/src/string/riscv64/memset_dispatch.c
new file mode 100644
index 00000000..aadf19fb
--- /dev/null
+++ b/src/string/riscv64/memset_dispatch.c
@@ -0,0 +1,38 @@
+#include "libc.h"
+#include <stddef.h>
+#include <stdint.h>
+#include <sys/auxv.h>
+
+void *memset_scalar(void *s, int c, size_t n);
+#ifdef __riscv_vector
+void *memset_vect(void *s, int c, size_t n);
+#endif
+
+/* Use scalar implementation by default */
+__attribute__((visibility("hidden")))
+void *(*__memset_ptr)(void *, int, size_t) = memset_scalar;
+
+void *memset(void *s, int c, size_t n)
+{
+	return __memset_ptr(s, c, n);
+}
+
+static int __has_rvv_via_hwcap(void)
+{
+	const unsigned long V_bit = (1ul << ('V' - 'A'));
+	unsigned long hwcap = getauxval(AT_HWCAP);
+	return (hwcap & V_bit) != 0;
+}
+
+__attribute__((visibility("hidden")))
+void __init_riscv_string_optimizations(void)
+{
+#ifdef __riscv_vector
+	if (__has_rvv_via_hwcap())
+		__memset_ptr = memset_vect;
+	else
+		__memset_ptr = memset_scalar;
+#else
+	__memset_ptr = memset_scalar;
+#endif
+}
diff --git a/test_memset b/test_memset
new file mode 100755
index 0000000000000000000000000000000000000000..5fe4bef7c7b1ebaf407fbc502df4f3ea312b0150
GIT binary patch
literal 8824
zcmeHNZ)_Y_5ud%YV>eA*J8qJuLG%&_FalrBj$_Q9ras$e$ExE5C;5N?J+JT9=Tq;`
zc5j_JY1P`77`I47t}IA|_>ksHBluEPm1qGaO{=I05-5BD3FSi&MXf}^E`fr8*v!10
zIp6KsB2+>G$xrgmo8Qd5e>?Bpn^`{_8|(`h21p6OZ6I#0%EP)?@...ar5Z^Wbizh_
zJ_Org1M*cG^W<IHnshw-NWjB|)_N=ik&>Oway#JZ3*E_jLOm&volSPf>RG`@e(K}J
zjwp?NrIbo=zBCamLV?>SEfR(5NQ(61o}JKI9)CSku_Ko3I@...?S#I}-3yg<p}FaG
z;qllTJVl`)lv4@mr>9HR!@...WOtJ%d9&KH6S|xAg_16nFQXlI<pKZZDa$_0{jIew
zw|IU+nT0*+)aagcq9vWmPE58qseJr+M|(@!?r2Yh^SN-VbWi?4<Dzw>DG@...!kGn
zOLD5ypEn=$>cfWiY?4tM^-q28>DJc@...O+$G2>MX5;Lhp+`PMc-68K(arcb!uvp!
z8X32J_(O~z_Tgc~>*0gtP_vBGYl%2&f6Av1O5qB{Ro*onS7ppp7BW^QZ@...sjNE&
z_Jo@...-dy<zzZHYNnN)L!ss1+YNdB_}8qq1TP&)eF-iuOe1`-1ZS5b7|u2U><zvZ
z4PFZVq%qiYXTE^+NF$tjr2t8&gKaqB;QqQFCxdlBYvu80epJra`S>%hi+lmP0^gtA
zhFvyRUo{F=1R9?(mZ$rS{*BKqEUG(!xtWV+cLK~$U+E}J8BmxSu&yT?tE&U^vo(SE
zH(rAI8h{txD$JHhW@-Y@...3%`?~JDaOvAKGtXaL(Yc-szC9Z-7NQLs-dR|D_8%8U
zs{5}en@...h#G~d!qmj&M#H$&^reQbrq1cE#ww$GzUK6$#tqOlUtM3de`aCv;oD8^
zRZCkgp8DMBr7PPmoIUmS?B)Hl^ELHV@...Yr}iP5Y@...Dz^T|nf@...pGg6sr}DO
zL9ET!EI#_%+wcC9bY{P{u=wa7ru+BLoQh0GgY8BX`|d7G(YhOYCji78C+CDrH)QAX
zDL0kNhCXp}Vk5AVgj_3+A~T=25<9`z5&U8ZZOQ*%U0Po5MEVZWL8N`l%gYI*vY&$|
zhrxK-Fm`MTXzW3hxv0AfDbyJ70Fuw*-yCA4e$tFsFaG@+@...ltvg?PU^?(zRqv)+
z2#jCpy|yWEzP6W-L4CGSpX7u6L&wNMWB2RY`eS_s)>mMC1=d$!eFfH6;D2ESUgi5~
zh~o)hDtE|F@*qP(InKc=nl2L8Y-hfi<s43e$m#w}CBpKLd2CTGcm8u6C;nw=IY%nj
zn3p^1T+kbt7XzRI^LRzlrHkpkOzECOWs66OPr5TvS$x=QPwqS;8U=kHqAFfCb=k)B
z2KOhp@...4*1LzP^v~>njY!;y*FIgOqX@s>wW#Rz_w;-r)QmT_PlsCfg<HcB@...d
z=-x<2WbfzAczaP4y^dx&wgclbR;%i(9SrkQZFZ~5wAm45Qf;iwZc_<*#0fjzijR}2
zc-|rIWNOUH#w|3#dL}cG)>6~V<h)AS$;WjqZh9o4$#H9M+gPzS$u$fvi#3sUw6iqb
z+SWPhiPL-Aq}I_+HJSbDG>ErY$ybzW1o5seep-$Pi~d@...$64l;hPP{#=f40P(AG
zyawbwmgBX>d`>xj4_uUeT_Ol#c1Q9CB{0SjxNA5cQI2l{$xoEy_Z9Oc<#-*)IZ%$@
zU)=xYcs+>!mg5cZ9`Da`d~<Q%mg9{O3dzZ}A_?MdZeA7lp9-E$5Vo!ubB$W9!Z_bA
z@...-`oF0aYeCLevGej*weX5BFYwDX+W!sW2CS@...UI-8v1`)1DA2W;p>+YD67qD
zlU~MJc+;mJM?8S1wjZ|`zv<JTWL$n%c=&Jzaf&+Sy&$X0cQsz=Cs(+g;_npe|4i#w
z`rG@...sp0BMvzJUA}#>);fG0e;5qh7wJ{@#cJ(8uJtSZ+0}UE{yl>@tg;_p)cTeF
z|3kznujJ1|T|#`dJk$-1SMG~jh&Kc(@%OoXe_rf^HT0KAKTr$eSIFw}B>uo|#&w=!
zf7r$NR>bS^y+?feaTnv_53CMh-3}#%^@...et|V=U*2PZC3(~jVP?vI4(!wISK_^l
z`}=)N<CWt)jCcdAoU3wgQHZaW?|XvV_m2m30x+C|-IM>PKb`5EnNUd^bCgdyrtPIN
z&Cyg#Hs@...Ks#%*Gxhxn{pAr3A?Fyi({p8ab)9;n|CM1#;|%fOR#crxZBGDBrL~D
zrt+9JRc=Pb)48mb$6+OMl2lE&IVZ2oiAjj(GIrW>tpwhCR@...I16Puj(I{^S=TuM
zV~&}zRAM5NIe{i6oTB-bsZ!ETd0F27c(xzqnuO!&8;%}~sn}4jQlNUjG88>H&;#o5
z@...q@3G;S8XOpUEZQGaN2A??F;r1_^HZ?{Y<Hk{7*zk@...CQI@...Bz9C3IZ0ia
zmsaXXZ}ezXb;tS#hNx3@...GUK&P~ES@...Ze|l{3sfSlv?G|VMX7Lf3}vSfSKUXB
z$gtIDKJVqpB}J}Oc^_87bWOk)k^nVu2z3*wteU{wC^qEJvSa_K%<Vxqe<I_WqexxH
zqvKN3iFO;p$?QaU+{}+dIB_D2>K=6+2;<1ju$^`x>{^rfbaHweVQZZIU_60Z!a-Q?
zh-mjf3+;xm?N}=w0^!p7ij(Qi%uFhd<H%tb`0#qndtKquKp4F?gU*NppwBE}({-KH
z=mdIO@...N6^h^J#Fg`0?!odsZ>O*6^3xgMQz)mHP3+~qe2xLiHB*uEQ{pG#n^5-Q
z@^Vh>B@...0Oh)=h`!i+C_oEVC<YY!;{6-XV=8kV(PN~si@...`w$~lv6p-L9NTxH
zthD&&{|E|cY~8FR_xT%aKf*Za(C>!7f3l-CCC-(5e}pGIg0j*gd48e&C@...-{Cp7
zf510jhKo%eWP6#v#1R)c5N|JPdKDSJz?Au?F2r8q4f%e#TNIi4`;YtVi*X4ri0!lY
z_n$#I#pu$%#4k46hkW+_{+~iQttG#%Vw@x2L;d{;FZ7#eOW#DqUgDu<{!k)%sG^JT
zLcfD3eKQez$(uy((*6S4(p1G>;w$-HE#IXjuOj`5|NH=L>HUel#9{C8haAyI6<vfE
z`ZA)W_L65=Wcvu)2@...ulwxf{uJT?ik{zI_&2aZIx>E_-!-pc@...=6>SOgk5laF
zd?_trFZAaq#J=rP9jeR!HF$C0id5_bAzpk}FruLbkDFvart_)||71Pn{GxbI=I&W|
e&vTr}dd4U8_fOsURCEJdIk9)kYp(e8+y4s;WmPW#

literal 0
HcmV?d00001

-- 
2.39.5

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.