Discussion:
[musl] stdio glitch & questions
Xan Phung
2018-11-30 10:51:39 UTC
Permalink
Hi,

A few questions about stdio:

(1) I notice __toread.c uses angular quotes for <stdio_impl.h> whereas all
other source files use "stdio_impl.h". I assume the latter is correct and
__toread.c's use of angular quotes was a glitch & it should really be
double quotes... is that correct?

(2) I notice vfprintf first tries to call printf_core with f=0 (line 667)
then calls printf_core again with f set to the actual file to receive
output (line 682). Why is printf_core called twice? I struggle to
understand the purpose of the first call with f=0.

(3) When I do a step thru the __fwritex function to understand how printf
works, I notice the resulting writev system calls pass on the output data
as a two element iovec array, with the 1st element comprising all line
buffered text up to & including the last variable data item, and then the
2nd element comprising the residual format string trailing the last
variable data item (more often than not just a single '\n').

For example, printf("error: %s\n", msg) would put all text up to &
including %s text in first iovec and the second iovec contains only '\n'.
I understand the rationale of this is to avoid copying the final '\n' to
the buffer at f->wpos. (There is actually guaranteed space in the buffer
itself due to a check at line 10 of fwrite.c). The use the array of 2x
iovec's presumably then relies on Linux kernel scatter-gather I/O to then
optimally handle the iovec array, ie: that the writev() of 2x iovec is more
efficient than avoiding the copy of a few additional bytes (often a single
'\n' byte) into f->wpos, and then using a single write() syscall.

Isn't this a big assumption? With Linux itself, can we really know that
Linux device drivers are smart enough to do writev() optimally? Also,
there is a lot of interest in porting musl to non-Linux os's, many of which
do not have writev(). (I am porting musl to WebAssembly and to Plan 9).

I can prepare a patch of a version using write() instead of writev() if
there is interest in this...

regards
Xan Phung
Rich Felker
2018-11-30 16:09:51 UTC
Permalink
Post by Xan Phung
Hi,
(1) I notice __toread.c uses angular quotes for <stdio_impl.h> whereas all
other source files use "stdio_impl.h". I assume the latter is correct and
__toread.c's use of angular quotes was a glitch & it should really be
double quotes... is that correct?
Yes, this doesn't make any difference but it's a style mistake.
Post by Xan Phung
(2) I notice vfprintf first tries to call printf_core with f=0 (line 667)
then calls printf_core again with f set to the actual file to receive
output (line 682). Why is printf_core called twice? I struggle to
understand the purpose of the first call with f=0.
To understand this you need to look inside printf_core. When called
with !f, it attempts to collect the %N$-form arguments if they're
used, or bails out early if it detects that normal % arguments are
used. Two passes are needed here because random access to a va_list is
not possible.
Post by Xan Phung
(3) When I do a step thru the __fwritex function to understand how printf
works, I notice the resulting writev system calls pass on the output data
as a two element iovec array, with the 1st element comprising all line
buffered text up to & including the last variable data item, and then the
2nd element comprising the residual format string trailing the last
variable data item (more often than not just a single '\n').
For example, printf("error: %s\n", msg) would put all text up to &
including %s text in first iovec and the second iovec contains only '\n'.
I understand the rationale of this is to avoid copying the final '\n' to
the buffer at f->wpos. (There is actually guaranteed space in the buffer
itself due to a check at line 10 of fwrite.c). The use the array of 2x
iovec's presumably then relies on Linux kernel scatter-gather I/O to then
optimally handle the iovec array, ie: that the writev() of 2x iovec is more
efficient than avoiding the copy of a few additional bytes (often a single
'\n' byte) into f->wpos, and then using a single write() syscall.
Indeed, in the case where the new data is very short, it's almost
certainly faster to just copy it to the buffer and perform a single
write syscall. Likewise, for reading a single character it's almost
surely faster to perform a single read syscall then pull it out of the
buffer.

However, conversely, it's possible to see a call to the stdio write
backend (f->write) where the new data is too large to fit in the
buffer. In this case, a writev syscall is almost certainly faster
(fewer trips back and forth between user and kernel space, which are
the dominant cost), and moving data into the buffer is not helpful
because it can't reduce the number of syscalls. Prior to commit
e3cd6c5c265cd481db6e0c5b529855d99f0bda30, fwrite contained heuristic
logic for individual cases, but it couldn't necessarily be optimal
under all usage patterns. After the change, the number of syscalls is
always minimized.
Post by Xan Phung
Isn't this a big assumption? With Linux itself, can we really know that
Linux device drivers are smart enough to do writev() optimally? Also,
there is a lot of interest in porting musl to non-Linux os's, many of which
do not have writev(). (I am porting musl to WebAssembly and to Plan 9).
I can prepare a patch of a version using write() instead of writev() if
there is interest in this...
You can emulate readv and writev using the property that short reads
and writes are permissible, copying data through a fixed-size
intermediate buffer on the stack. This is of course suboptimal but
easy to do.

Emulation of readv is really sensitive, because breaking it up into
multiple reads can cause inapprpriate blocking. Linux actually has a
bug where this can happen anyway -- see commit
2cff36a84f268c09f4c9dc5a1340652c8e298dc0 -- so musl's __stdio_read
already reads the last character of the first (caller-requested) part
through the buffer, and collapses the readv to a read if this makes
the first iov empty.

It would probably be welcome to make __stdio_write make use of
SYS_write when it would be expected to be faster (len very small), but
I'm not sure what the exact cutoff should be. Switching away from
writev/readv would not be a welcome change though; use of them is very
intentional and it's how musl avoids some pathological slowness under
certain stdio usage patterns.

If you're porting to a system that lacks the underlying syscalls, I
think it probably makes sense to emulate them at the syscall() level
using a strategy like I described above. It's necessary for making the
public readv()/writev() functions work anyway.

Rich
Xan Phung
2018-11-30 22:15:56 UTC
Permalink
Thanks for the quick answer, and I've taken a look at the pre-2011 fwrite.c
code, and using SYS_writev is indeed much cleaner code!

See below on my proposed answer to your question about what the "cutoff"
should be for copying. (SYS_writev is fully retained, but the 2nd iovec
element will very often be zero length in this proposal, which makes
emulation of SYS_writev much more efficient).
Post by Rich Felker
It would probably be welcome to make __stdio_write make use of
SYS_write when it would be expected to be faster (len very small), but
I'm not sure what the exact cutoff should be.
My proposal is the cutoff be 5-8 bytes (on 32 bit CPUs) , and on 64 bit
CPUs, 9-16 bytes.

The cutoffs are selected in such a way that the "no copy" loop (searching
for '\n') always ends on a word aligned position (opening the door to
future optimisations by using word-at-a-time search for '\n' instead of
byte-at-a-time). The "copy" branch is also guaranteed to only be a double
word at most, but a minimum of a single word (allowing a two word memcpy to
be done with just a 2x load/mask/store word code sequence). Some example
code is shown to give a general idea of word-aligning the cutoff amount
(but not yet doing word-at-a-time searching of '\n', or optimised two word
memcpy).

*CURRENT __fwritex.c CODE (lines 12-20)*:


if (f->lbf >= 0) {

/* Match /^(.*\n|)/ */

for (i=l; i && s[i-1] != '\n'; i--);

if (i) {

size_t n = f->write(f, s, i);

if (n < i) return n;

s += i;

l -= i;

}

}



*PROPOSED*:

size_t i, len;
if (f->lbf >= 0) {
const unsigned char *t = ALIGN(s+sizeof(size_t)*2);
for (i = l+s-t; ; i--) {
if (i <= 0) { /* SHORT LINE - copy up to 16 bytes into f->wpos
buffer and then flush line */
for (j = t-s; j && s[j-1] != '\n'; j--);
if (j) {
memcpy(f->wpos, s, j); f->wpos += j;
size_t n = f->write(f, t, 0);
if (n < 0) return n;
s += j;
l -= j;
} break;
}
if (t[i-1] == '\n') {
size_t n = f->write(f, s, len = i+t-s);
if (n < len) return n;
s += len;
l -= len;
break;
}
}
}
Rich Felker
2018-12-01 00:02:29 UTC
Permalink
Post by Xan Phung
Thanks for the quick answer, and I've taken a look at the pre-2011 fwrite.c
code, and using SYS_writev is indeed much cleaner code!
See below on my proposed answer to your question about what the "cutoff"
should be for copying. (SYS_writev is fully retained, but the 2nd iovec
element will very often be zero length in this proposal, which makes
emulation of SYS_writev much more efficient).
Post by Rich Felker
It would probably be welcome to make __stdio_write make use of
SYS_write when it would be expected to be faster (len very small), but
I'm not sure what the exact cutoff should be.
My proposal is the cutoff be 5-8 bytes (on 32 bit CPUs) , and on 64 bit
CPUs, 9-16 bytes.
The cutoffs are selected in such a way that the "no copy" loop (searching
for '\n') always ends on a word aligned position (opening the door to
future optimisations by using word-at-a-time search for '\n' instead of
byte-at-a-time). The "copy" branch is also guaranteed to only be a double
word at most, but a minimum of a single word (allowing a two word memcpy to
be done with just a 2x load/mask/store word code sequence). Some example
code is shown to give a general idea of word-aligning the cutoff amount
(but not yet doing word-at-a-time searching of '\n', or optimised two word
memcpy).
if (f->lbf >= 0) {
/* Match /^(.*\n|)/ */
for (i=l; i && s[i-1] != '\n'; i--);
if (i) {
size_t n = f->write(f, s, i);
if (n < i) return n;
s += i;
l -= i;
}
}
size_t i, len;
if (f->lbf >= 0) {
const unsigned char *t = ALIGN(s+sizeof(size_t)*2);
for (i = l+s-t; ; i--) {
if (i <= 0) { /* SHORT LINE - copy up to 16 bytes into f->wpos
buffer and then flush line */
for (j = t-s; j && s[j-1] != '\n'; j--);
if (j) {
memcpy(f->wpos, s, j); f->wpos += j;
size_t n = f->write(f, t, 0);
if (n < 0) return n;
s += j;
l -= j;
} break;
}
if (t[i-1] == '\n') {
size_t n = f->write(f, s, len = i+t-s);
if (n < len) return n;
s += len;
l -= len;
break;
}
}
}
I've been trying to understand what you're trying to do. It seems you
chose to work at the point of line-buffered flush logic, since that
happens to be the only case where f->write is called with an argument
that might fit in the remaining buffer space.

As written the alignment logic and pointer arithmetic is invalid; the
sums/differences are out of bounds of the array, and i<=0 is not
meaningful since i has an unsigned type (and so does l+s-t). But even
if it could be made correct, it's all completely unnecessary and just
making the code slower and less readable.

If __fwritex were the right place for this code, all you would need to
do is check whether i<16 (or whatever threshold) before calling
f->write, and if so, memcpy'ing it to the buffer then calling f->write
with a length of 0. However, then you could not use the return value
of f->write to determine if it succeeded (see how fflush and fseek
have to deal with this case). Contrary to what your code assumes,
f->write does not (and cannot, since the type is unsigned) return a
negative value on error.

Instead, I think it probably makes more sense to put the logic in
__stdio_write, but this will also be somewhat nontrivial to work in.
At least the "iovcnt == 2 ? ..." logic needs to be adapted to
something like "rem > len ? ...". Before the loop should probably be
something like "if (len < f->wend-f->wpos && len <= 16) ..." to
conditionally copy the new data into the buffer.

Do you see any reason to prefer doing it in __fwritex?

Rich
Xan Phung
2018-12-01 02:42:30 UTC
Permalink
This post might be inappropriate. Click to display it.
Rich Felker
2018-12-01 03:17:36 UTC
Permalink
Post by Xan Phung
Post by Rich Felker
I've been trying to understand what you're trying to do. It seems you
chose to work at the point of line-buffered flush logic, since that
happens to be the only case where f->write is called with an argument
that might fit in the remaining buffer space.
Yes that's correct... Also, the other reason I chose __fwrite.c is it's the
only place where the search for '\n' is done.
An additional objective I had was to split the loop searching for '\n' into
(i) Stage 1: search for '\n' word by word ie: 8 bytes at a time ... if '\n'
found at >~16 bytes before start of "s" array then go straight to f->write
without copying bytes
(I don't show it in my code, but the algorithm to do stage 1 would be
similar to memchr.c)
(ii) Stage 2: in final 9~16 bytes, drop down into byte-by-byte searching
for '\n' (also doing copy of residual bytes into buffer when '\n' found)
Post by Rich Felker
As written the alignment logic and pointer arithmetic is invalid; the
sums/differences are out of bounds of the array, and i<=0 is not
meaningful since i has an unsigned type (and so does l+s-t). But even
if it could be made correct, it's all completely unnecessary and just
making the code slower and less readable. If __fwritex were the right
place for this code, all you would need to
do is check whether i<16 (or whatever threshold) before calling
Post by Rich Felker
f->write, and if so, memcpy'ing it to the buffer then calling f->write
with a length of 0.
I agree, the check for i <= X is a simpler way of expressing the algorithm.
[X is a value between 9 and 16 that guarantees (s + X) will be word aligned]
In my version, I introduced a new array base "t" and rebase i to index into
"t" because "t" is a guaranteed word aligned pointer.
However, this word alignment of "t" is not exploited in the current
byte-at-a-time searching for '\n', so yes at present it's unnecessary.
Post by Rich Felker
However, then you could not use the return value
of f->write to determine if it succeeded (see how fflush and fseek
have to deal with this case). Contrary to what your code assumes,
f->write does not (and cannot, since the type is unsigned) return a
negative value on error.
Ok, noted, the expression to check for error should be (!f->wpos) and not n
< 0
Instead, I think it probably makes more sense to put the logic in
Post by Rich Felker
__stdio_write, but this will also be somewhat nontrivial to work in.
At least the "iovcnt == 2 ? ..." logic needs to be adapted to
something like "rem > len ? ...". Before the loop should probably be
something like "if (len < f->wend-f->wpos && len <= 16) ..." to
conditionally copy the new data into the buffer.
Do you see any reason to prefer doing it in __fwritex?
(a) facilitates word-by-word searching objective outlined above
Whether a faster search for '\n' is possible or makes sense is
completely independent of where or whether you copy into the buffer.
The search is necessary either way, and the scope of the search has to
be the entire input array.

There is a namespace-safe __memrchr function that could be used here
if it were optimized, but I don't think open-coding an optimized
version of a string function inside a stdio function makes sense.
Post by Xan Phung
(b) check for space in buffer is already done in line 10 of __fwrite.c,
hence avoids need for any more buffer length checks
Indeed, that is potentially a reason for doing it here.
Post by Xan Phung
(c) ?? speeds up writes which doesn't use __stdio_write
You mean other backends like fmemopen, open_memstream, fopencookie,
etc.? This is possibly also a good reason, but in order for it to
help, some of their write functions might need improvements too. For
example I just noticed that fopencookie's cookiewrite() function
unnecessarily calls the application's write function a second time
with length 0 when length 0 was passed to it. This should probably be
avoided for other reasons, since it's not clear that the callback
needs to accept 0 as a valid length.

Note that __stdio_write also needs some changes to avoid unnecessary
writev syscalls when write would suffice.

So anyway, I think there are probably some good reasons to do the
merging in __fwritex rather than in the __stdio_write backend. But it
should be a lot simpler than what you initially proposed.

Rich
Xan Phung
2018-12-01 08:02:02 UTC
Permalink
Post by Rich Felker
So anyway, I think there are probably some good reasons to do the
merging in __fwritex rather than in the __stdio_write backend. But it
should be a lot simpler than what you initially proposed.
Rich
Thanks for your feedback! In interest of keeping the proposal as simple as
possible, I've gone with just one extra line of code (see below). It
simply does copying for i <= 8.

It works on my initial testing and I'm getting zero lengths in the 2nd
writev iovec (as expected), but I'll test it more in coming week.

size_t __fwritex(const unsigned char *restrict s, size_t l, FILE *restrict
f)

{

size_t i=0;


if (!f->wend && __towrite(f)) return 0;


if (l > f->wend - f->wpos) return f->write(f, s, l);


if (f->lbf >= 0) {

/* Match /^(.*\n|)/ */

for (i=l; i && s[i-1] != '\n'; i--);

if (i) {

*if (i <= 8) for (; i; i--, l--) *(f->wpos)++ = *s++;*

size_t n = f->write(f, s, i);

if (*!f->wpos ||* n < i) return n;

l -= i;

s += i;

}

}


memcpy(f->wpos, s, l);

f->wpos += l;

return l+i;

}

Loading...