Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250710154414.GU1827@brightrain.aerifal.cx>
Date: Thu, 10 Jul 2025 11:44:15 -0400
From: Rich Felker <dalias@...c.org>
To: Stephen Von Takach <steve@...ce.technology>
Cc: musl@...ts.openwall.com, Viv Briffa <viv@...ce.technology>
Subject: Re: unlink on NFS volume fails silently

On Thu, Jul 10, 2025 at 02:58:30PM +1000, Stephen Von Takach wrote:
> Yeah I see your point and this was closed as a kernel issue:
> https://gitlab.alpinelinux.org/alpine/aports/-/issues/10960

OK, is your issue unlink falsely succeeding, or readdir skipping
entries? The latter is a known bug in the kernel NFS client. One of my
comments on the tracker suggests:

  "The nordirplus option mentioned in one of those tracker threads
  might be a workaround."

I'm not sure if this is the case, but it might be worth trying.

Note that it's *expected* that an already-in-progress iteration of a
directory may return entries that were already deleted. The
unacceptable thing is the opposite: when it skips some entries that
have not been deleted as a consequence of other things being deleted.

> We're running these two containers on the same kernel and seeing the same
> behaviour as that alpine issue.
> Happy to continue working around the issue by using debian userspace to
> build our service.
> 
> It does seems crazy that there is clearly an issue, possibly a kernel issue
> that is being handwaved away by all parties

It's not "handwaved away" by us. We have determined that there is a
bug in a component we have no control over, and for which we have no
sound means of working around.

I'm happy to work together on tracking down the cause to get it fixed,
but that requires cooperation from someone who's able to reproduce it,
documenting the exact circumstances under which it occurs (NFS server
vendor/version, NFS mount options) and either producing a minimal test
program to reproduce the issue under those conditions, or being
willing to run a proposed test by someone else.

Even if using Debian/glibc *seems* to make things work for you, I
think it would be beneficial for you to try to get to the root cause
of the problem and get it fixed. What we previously found on the
above-linked ticket was that glibc is not doing anything special that
should rule out that bug, only that the particular filename
sizes/counts in the test didn't trigger the bug with glibc.

Again, I don't know if this is the same bug you're hitting (this is
the first time in the thread you've mentioned readdir if I'm not
mistaken, as opposed to just unlink) or if there's a second bug in
play here. If you could at least clarify that, it would be a big help
to anyone investigating it in the future.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.