libc-coord - Re: stdio_ext.h extensions for gnulib

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJgzZoqO+_nL5XjESpB-iaMbBWswu59+VrN2qLnmysboGjkJ_A@mail.gmail.com>
Date: Fri, 15 Apr 2022 17:39:13 -0700
From: enh <enh@...gle.com>
To: libc-coord@...ts.openwall.com
Subject: Re: stdio_ext.h extensions for gnulib

On Tue, Apr 12, 2022 at 9:00 PM Rich Felker <dalias@...c.org> wrote:

> Early on in musl's history, we added a set of further extensions in
> stdio_ext.h:
>
> size_t __freadahead(FILE *);
> const char *__freadptr(FILE *, size_t *);
> void __freadptrinc(FILE *, size_t);
> void __fseterr(FILE *);
>
> The purpose of these functions was to provide a way for gnulib to do
> the things they already insisted on doing, but without having access
> to the FILE representation internals (which is how they implemented
> these things for every other system at the time).
>
> The topic recently came up again via Toybox, where the author is not
> using gnulib but was looking for some of the same functionality. I'm
> told Bionic is on-board with adding these,


apparently i shipped __fseterr() years ago (despite the fact that if you'd
asked me last week, i'd have said "we only have the stuff glibc *and* musl
both have"), and this week i was more convinced by __freadahead() than the
other two.

i was a bit concerned about how well the other two map to all extant stdio
implementations, but it turns out i have a bit of a problem implementing
__freadahead() too [and an existing bug or two i didn't previously know
about]. it looks like musl is quite strict with ungetc() --- there's a
fixed-size always-allocated unget buffer that's just before the actual
buffer and only used if you try to ungetc at the start of the file? the BSD
implementation in bionic instead has an "out of line" unget buffer that
will grow arbitrarily large. which is problematic for messing with the read
pointer unless i've misunderstood what those two functions actually do? i
didn't find any documentation.

for me it's a bit problematic for __freadahead() too because it turns out
that POSIX's "The pushed-back bytes shall be returned by subsequent reads
on that stream in the reverse order of their pushing" isn't exactly true
for bionic. i think a subsequent getc() will return these characters, but i
don't think a subsequent fread() would, for example. i've not seen any bug
reports around this, so i guess the default "3 bytes is enough for anyone"
buffer is actually enough in practice (or people who ungetc() only read via
getc()).

so i think my choices are either:

1. fix the arbitrary unget buffer, and have __freadahead() count characters
in that, and give up on functions that touch the read pointer.
2. say "well, looks like no-one's using unlimited unget anyway" and use the
musl "8 bytes is enough for anyone ... which means we can just have a
slightly larger single buffer" trick, which also lets us implement the read
pointer functions.

i haven't had much time to look at this, so i'm also not sure how important
the code to switch to a separate buffer if you ungetc() a _different_ byte
to the one that was just returned. there's a comment there about sscanf()
on const data (
https://android.googlesource.com/platform/bionic/+/master/libc/upstream-openbsd/lib/libc/stdio/ungetc.c#124
).

fwiw, i did run a quick test program on current macOS and glibc 2.33 and
they both appear to allow unlimited amounts of unget.


> and it might make sense to
> coordinate this with glibc and other libc implementations/other
> systems, or even propose these as some sort of standard. In the long
> term, having these would allow systems to decouple gnulib from their
> stdio internals, and eventually maybe even have the freedom to change
> stdio internals if it ever makes sense to do so.
>

normally the demands of app compat would mean i probably wouldn't be able
to change how the unget buffer works (just as much of the rest of stdio is
effectively frozen), but given that the musl implementation lets you
effectively pretend there's no such thing, and all reads are normal reads,
i might be able to get away with that. (although it seems like sscanf()
would have to change to _copy_, and i'm stuck if apps are actually relying
on being able to unget arbitrary amounts of data, so i think my option #1
above is still the less implausible.)


> Rich
>

Content of type "text/html" skipped
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.