Date: Sun, 11 Jun 2017 14:57:59 -0600 From: Benjamin Slade <slade@...nam.net> To: Joakim Sindholt <opensource@...sha.com> Cc: musl@...ts.openwall.com Subject: Re: ENOSYS/EOPNOTSUPP fallback? Thank you for the extensive reply. Just to be clear: I'm just an end-user of flatpak, &c. As far as I can tell, flatpak is making use of `ostree` which assumes that the libc will take care of handling `dd` fallback (I got the impression that flatpak isn't directly calling `fallocate` itself). Do you think there's an obvious avenue for following up on this? Admittedly this is an edge-case that won't necessarily affect musl users on ext4, but it will affect musl users on zfs (and I believe f2fs). Do you think `ostree` shouldn't rely on the libc for fallback? Or should ZFS on Linux implement a fallback for fallocate? -- Benjamin Slade `(pgp_fp: ,(21BA 2AE1 28F6 DF36 110A 0E9C A320 BBE8 2B52 EE19)) '(sent by mu4e on Emacs running under GNU/Linux . https://gnu.org ) '(Choose Linux, Choose Freedom . https://linux.com ) On 2017-06-05T06:46:33-0600, Joakim Sindholt <opensource@...sha.com> wrote: > On Sun, Jun 04, 2017 at 09:22:27PM -0600, Benjamin Slade wrote: > > I ran into what is perhaps a weird edge case. I'm running a system with > > musl that uses a ZFS root fs. When I was trying to install some > > flatpaks, I got an `fallocate` failure, with no `dd` fallback. Querying > > the flatpak team, the fallback to `dd` seems to be something which glibc > > does (and so the other components assume will be taken care). > > > > Here is the exchange regarding this issue: > > https://github.com/flatpak/flatpak/issues/802 > To quote the glibc source file linked in the bug: > /* Minimize data transfer for network file systems, by issuing > single-byte write requests spaced by the file system block size. > (Most local file systems have fallocate support, so this fallback > code is not used there.) */ > /* NFS clients do not propagate the block size of the underlying > storage and may report a much larger value which would still > leave holes after the loop below, so we cap the increment at > 4096. */ > /* Write a null byte to every block. This is racy; we currently > lack a better option. Compare-and-swap against a file mapping > might address local races, but requires interposition of a signal > handler to catch SIGBUS. */ > Which leaves 2 massive bugs: > 1) the leaving of unallocated gaps both because of the NFS thing but > also because other file systems may work on entirely different > principles that are not accounted for here and > 2) overwriting data currently being written to the file as it's being > forcibly allocated (which might be doing nothing, think deduplication). > This is not a viable general solution and furthermore fallocate is > mostly just an optimization hint. If it's a hard requirement of your > software I would suggest implementing it in your file system. These > operations can only be safely implemented in the kernel. > An example: > MyFS uses write time deduplication on unused blocks (and blocks with all > zeroes fall under the umbrella of unused). Glibc starts its dance where > it writes a zero byte to the beginning of each block it perceives and > for now let's just say it has the right block size. MyFS just trashes > these writes immediately without touching the disk and updates the size > metadata which gets lazily written at some point. There's only 400k left > on the disk and your fallocate of 16G will succeed and run exceptionally > fast to boot, but it will have allocated nothing and your next write > fails with ENOSPC. > Another example: > myutil has 2 threads running. One thread is constantly writing things to > a file. The other thread sometimes writes large chunks of data to the > file and so it hints the kernel to allocate these large chunks by > calling fallocate, and only then taking the lock(s) held internally to > synchronize the threads. The first thread finds it needs to update > something in the section currently being fallocated by glibc's > algorithm. Suddenly zero bytes appear at 4k intervals for no discernible > reason, overwriting the data. > Personally I would look into seeing to it that flatpak only uses > fallocate as an optimization. The most reliable thing I can think of > otherwise would be to do the locking necessary (if any) in the program > and filling the entire target section of the file with data from > /dev/urandom, but even that may fail spectacularly with transparent > compression (albeit unlikely). > Hope this was at least somewhat helpful.
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.