Message ID | 20241015211719.1152862-1-irogers@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2,1/3] proc_pid_fdinfo.5: Reduce indent for most of the page | expand |
On Tue, Oct 15, 2024 at 02:17:17PM -0700, Ian Rogers wrote: > When /proc/pid/fdinfo was part of proc.5 man page the indentation made > sense. As a standalone man page the indentation doesn't need to be so > far over to the right. Remove the initial tagged pragraph and move the > styling to the initial summary description. > > Suggested-by: G. Branden Robinson <g.branden.robinson@gmail.com> > Signed-off-by: Ian Rogers <irogers@google.com> > --- > man/man5/proc_pid_fdinfo.5 | 66 ++++++++++++++++++-------------------- > 1 file changed, 32 insertions(+), 34 deletions(-) > > diff --git a/man/man5/proc_pid_fdinfo.5 b/man/man5/proc_pid_fdinfo.5 > index 1e23bbe02..8678caf4a 100644 > --- a/man/man5/proc_pid_fdinfo.5 > +++ b/man/man5/proc_pid_fdinfo.5 > @@ -6,20 +6,19 @@ > .\" > .TH proc_pid_fdinfo 5 (date) "Linux man-pages (unreleased)" > .SH NAME > -/proc/pid/fdinfo/ \- information about file descriptors > +.IR /proc/ pid /fdinfo " \- information about file descriptors" I wouldn't add formatting here for now. That's something I prefer to be cautious about, and if we do it, we should do it in a separate commit. > .SH DESCRIPTION > -.TP > -.IR /proc/ pid /fdinfo/ " (since Linux 2.6.22)" > -This is a subdirectory containing one entry for each file which the > -process has open, named by its file descriptor. > -The files in this directory are readable only by the owner of the process. > -The contents of each file can be read to obtain information > -about the corresponding file descriptor. > -The content depends on the type of file referred to by the > -corresponding file descriptor. > -.IP > +Since Linux 2.6.22, You could move this information to a HISTORY section. > +this subdirectory contains one entry for each file that process > +.I pid > +has open, named by its file descriptor. The files in this directory Please don't reflow existing text. Please read about semantic newlines in man-pages(7): $ MANWIDTH=72 man man-pages | sed -n '/Use semantic newlines/,/^$/p' Use semantic newlines In the source of a manual page, new sentences should be started on new lines, long sentences should be split into lines at clause breaks (commas, semicolons, colons, and so on), and long clauses should be split at phrase boundaries. This convention, sometimes known as "semantic newlines", makes it easier to see the effect of patches, which often operate at the level of individual sen‐ tences, clauses, or phrases. Have a lovely day! Alex > +are readable only by the owner of the process. The contents of each > +file can be read to obtain information about the corresponding file > +descriptor. The content depends on the type of file referred to by > +the corresponding file descriptor. > +.P > For regular files and directories, we see something like: > -.IP > +.P > .in +4n > .EX > .RB "$" " cat /proc/12015/fdinfo/4" > @@ -28,7 +27,7 @@ flags: 01002002 > mnt_id: 21 > .EE > .in > -.IP > +.P > The fields are as follows: > .RS > .TP > @@ -51,7 +50,6 @@ this field incorrectly displayed the setting of > at the time the file was opened, > rather than the current setting of the close-on-exec flag. > .TP > -.I > .I mnt_id > This field, present since Linux 3.15, > .\" commit 49d063cb353265c3af701bab215ac438ca7df36d > @@ -59,13 +57,13 @@ is the ID of the mount containing this file. > See the description of > .IR /proc/ pid /mountinfo . > .RE > -.IP > +.P > For eventfd file descriptors (see > .BR eventfd (2)), > we see (since Linux 3.8) > .\" commit cbac5542d48127b546a23d816380a7926eee1c25 > the following fields: > -.IP > +.P > .in +4n > .EX > pos: 0 > @@ -74,16 +72,16 @@ mnt_id: 10 > eventfd\-count: 40 > .EE > .in > -.IP > +.P > .I eventfd\-count > is the current value of the eventfd counter, in hexadecimal. > -.IP > +.P > For epoll file descriptors (see > .BR epoll (7)), > we see (since Linux 3.8) > .\" commit 138d22b58696c506799f8de759804083ff9effae > the following fields: > -.IP > +.P > .in +4n > .EX > pos: 0 > @@ -93,7 +91,7 @@ tfd: 9 events: 19 data: 74253d2500000009 > tfd: 7 events: 19 data: 74253d2500000007 > .EE > .in > -.IP > +.P > Each of the lines beginning > .I tfd > describes one of the file descriptors being monitored via > @@ -110,13 +108,13 @@ descriptor. > The > .I data > field is the data value associated with this file descriptor. > -.IP > +.P > For signalfd file descriptors (see > .BR signalfd (2)), > we see (since Linux 3.8) > .\" commit 138d22b58696c506799f8de759804083ff9effae > the following fields: > -.IP > +.P > .in +4n > .EX > pos: 0 > @@ -125,7 +123,7 @@ mnt_id: 10 > sigmask: 0000000000000006 > .EE > .in > -.IP > +.P > .I sigmask > is the hexadecimal mask of signals that are accepted via this > signalfd file descriptor. > @@ -135,12 +133,12 @@ and > .BR SIGQUIT ; > see > .BR signal (7).) > -.IP > +.P > For inotify file descriptors (see > .BR inotify (7)), > we see (since Linux 3.8) > the following fields: > -.IP > +.P > .in +4n > .EX > pos: 0 > @@ -150,7 +148,7 @@ inotify wd:2 ino:7ef82a sdev:800001 mask:800afff ignored_mask:0 fhandle\-bytes:8 > inotify wd:1 ino:192627 sdev:800001 mask:800afff ignored_mask:0 fhandle\-bytes:8 fhandle\-type:1 f_handle:27261900802dfd73 > .EE > .in > -.IP > +.P > Each of the lines beginning with "inotify" displays information about > one file or directory that is being monitored. > The fields in this line are as follows: > @@ -168,19 +166,19 @@ The ID of the device where the target file resides (in hexadecimal). > .I mask > The mask of events being monitored for the target file (in hexadecimal). > .RE > -.IP > +.P > If the kernel was built with exportfs support, the path to the target > file is exposed as a file handle, via three hexadecimal fields: > .IR fhandle\-bytes , > .IR fhandle\-type , > and > .IR f_handle . > -.IP > +.P > For fanotify file descriptors (see > .BR fanotify (7)), > we see (since Linux 3.8) > the following fields: > -.IP > +.P > .in +4n > .EX > pos: 0 > @@ -190,7 +188,7 @@ fanotify flags:0 event\-flags:88002 > fanotify ino:19264f sdev:800001 mflags:0 mask:1 ignored_mask:0 fhandle\-bytes:8 fhandle\-type:1 f_handle:4f261900a82dfd73 > .EE > .in > -.IP > +.P > The fourth line displays information defined when the fanotify group > was created via > .BR fanotify_init (2): > @@ -210,7 +208,7 @@ argument given to > .BR fanotify_init (2) > (expressed in hexadecimal). > .RE > -.IP > +.P > Each additional line shown in the file contains information > about one of the marks in the fanotify group. > Most of these fields are as for inotify, except: > @@ -228,16 +226,16 @@ The events mask for this mark > The mask of events that are ignored for this mark > (expressed in hexadecimal). > .RE > -.IP > +.P > For details on these fields, see > .BR fanotify_mark (2). > -.IP > +.P > For timerfd file descriptors (see > .BR timerfd (2)), > we see (since Linux 3.17) > .\" commit af9c4957cf212ad9cf0bee34c95cb11de5426e85 > the following fields: > -.IP > +.P > .in +4n > .EX > pos: 0 > -- > 2.47.0.rc1.288.g06298d1525-goog >
On Fri, Nov 1, 2024 at 6:24 AM Alejandro Colomar <alx@kernel.org> wrote: > > On Tue, Oct 15, 2024 at 02:17:17PM -0700, Ian Rogers wrote: > > When /proc/pid/fdinfo was part of proc.5 man page the indentation made > > sense. As a standalone man page the indentation doesn't need to be so > > far over to the right. Remove the initial tagged pragraph and move the > > styling to the initial summary description. > > > > Suggested-by: G. Branden Robinson <g.branden.robinson@gmail.com> > > Signed-off-by: Ian Rogers <irogers@google.com> > > --- > > man/man5/proc_pid_fdinfo.5 | 66 ++++++++++++++++++-------------------- > > 1 file changed, 32 insertions(+), 34 deletions(-) > > > > diff --git a/man/man5/proc_pid_fdinfo.5 b/man/man5/proc_pid_fdinfo.5 > > index 1e23bbe02..8678caf4a 100644 > > --- a/man/man5/proc_pid_fdinfo.5 > > +++ b/man/man5/proc_pid_fdinfo.5 > > @@ -6,20 +6,19 @@ > > .\" > > .TH proc_pid_fdinfo 5 (date) "Linux man-pages (unreleased)" > > .SH NAME > > -/proc/pid/fdinfo/ \- information about file descriptors > > +.IR /proc/ pid /fdinfo " \- information about file descriptors" > > I wouldn't add formatting here for now. That's something I prefer to be > cautious about, and if we do it, we should do it in a separate commit. I'll move it to a separate patch. Is the caution due to a lack of test infrastructure? That could be something to get resolved, perhaps through Google summer-of-code and the like. > > .SH DESCRIPTION > > -.TP > > -.IR /proc/ pid /fdinfo/ " (since Linux 2.6.22)" > > -This is a subdirectory containing one entry for each file which the > > -process has open, named by its file descriptor. > > -The files in this directory are readable only by the owner of the process. > > -The contents of each file can be read to obtain information > > -about the corresponding file descriptor. > > -The content depends on the type of file referred to by the > > -corresponding file descriptor. > > -.IP > > +Since Linux 2.6.22, > > You could move this information to a HISTORY section. Sure, tbh I'm not sure anybody cares about this information and it could be as well to delete it. Sorry people running 17 year old kernels. For now I'll try to leave it unchanged. > > +this subdirectory contains one entry for each file that process > > +.I pid > > +has open, named by its file descriptor. The files in this directory > > Please don't reflow existing text. Please read about semantic newlines > in man-pages(7): > > $ MANWIDTH=72 man man-pages | sed -n '/Use semantic newlines/,/^$/p' > Use semantic newlines > In the source of a manual page, new sentences should be started > on new lines, long sentences should be split into lines at clause > breaks (commas, semicolons, colons, and so on), and long clauses > should be split at phrase boundaries. This convention, sometimes > known as "semantic newlines", makes it easier to see the effect > of patches, which often operate at the level of individual sen‐ > tences, clauses, or phrases. I'll update for v3 but I'm reminded of `git diff --word-diff=color` so perhaps this recommendation is outdated. Thanks, Ian > Have a lovely day! > Alex > > > +are readable only by the owner of the process. The contents of each > > +file can be read to obtain information about the corresponding file > > +descriptor. The content depends on the type of file referred to by > > +the corresponding file descriptor. > > +.P > > For regular files and directories, we see something like: > > -.IP > > +.P > > .in +4n > > .EX > > .RB "$" " cat /proc/12015/fdinfo/4" > > @@ -28,7 +27,7 @@ flags: 01002002 > > mnt_id: 21 > > .EE > > .in > > -.IP > > +.P > > The fields are as follows: > > .RS > > .TP > > @@ -51,7 +50,6 @@ this field incorrectly displayed the setting of > > at the time the file was opened, > > rather than the current setting of the close-on-exec flag. > > .TP > > -.I > > .I mnt_id > > This field, present since Linux 3.15, > > .\" commit 49d063cb353265c3af701bab215ac438ca7df36d > > @@ -59,13 +57,13 @@ is the ID of the mount containing this file. > > See the description of > > .IR /proc/ pid /mountinfo . > > .RE > > -.IP > > +.P > > For eventfd file descriptors (see > > .BR eventfd (2)), > > we see (since Linux 3.8) > > .\" commit cbac5542d48127b546a23d816380a7926eee1c25 > > the following fields: > > -.IP > > +.P > > .in +4n > > .EX > > pos: 0 > > @@ -74,16 +72,16 @@ mnt_id: 10 > > eventfd\-count: 40 > > .EE > > .in > > -.IP > > +.P > > .I eventfd\-count > > is the current value of the eventfd counter, in hexadecimal. > > -.IP > > +.P > > For epoll file descriptors (see > > .BR epoll (7)), > > we see (since Linux 3.8) > > .\" commit 138d22b58696c506799f8de759804083ff9effae > > the following fields: > > -.IP > > +.P > > .in +4n > > .EX > > pos: 0 > > @@ -93,7 +91,7 @@ tfd: 9 events: 19 data: 74253d2500000009 > > tfd: 7 events: 19 data: 74253d2500000007 > > .EE > > .in > > -.IP > > +.P > > Each of the lines beginning > > .I tfd > > describes one of the file descriptors being monitored via > > @@ -110,13 +108,13 @@ descriptor. > > The > > .I data > > field is the data value associated with this file descriptor. > > -.IP > > +.P > > For signalfd file descriptors (see > > .BR signalfd (2)), > > we see (since Linux 3.8) > > .\" commit 138d22b58696c506799f8de759804083ff9effae > > the following fields: > > -.IP > > +.P > > .in +4n > > .EX > > pos: 0 > > @@ -125,7 +123,7 @@ mnt_id: 10 > > sigmask: 0000000000000006 > > .EE > > .in > > -.IP > > +.P > > .I sigmask > > is the hexadecimal mask of signals that are accepted via this > > signalfd file descriptor. > > @@ -135,12 +133,12 @@ and > > .BR SIGQUIT ; > > see > > .BR signal (7).) > > -.IP > > +.P > > For inotify file descriptors (see > > .BR inotify (7)), > > we see (since Linux 3.8) > > the following fields: > > -.IP > > +.P > > .in +4n > > .EX > > pos: 0 > > @@ -150,7 +148,7 @@ inotify wd:2 ino:7ef82a sdev:800001 mask:800afff ignored_mask:0 fhandle\-bytes:8 > > inotify wd:1 ino:192627 sdev:800001 mask:800afff ignored_mask:0 fhandle\-bytes:8 fhandle\-type:1 f_handle:27261900802dfd73 > > .EE > > .in > > -.IP > > +.P > > Each of the lines beginning with "inotify" displays information about > > one file or directory that is being monitored. > > The fields in this line are as follows: > > @@ -168,19 +166,19 @@ The ID of the device where the target file resides (in hexadecimal). > > .I mask > > The mask of events being monitored for the target file (in hexadecimal). > > .RE > > -.IP > > +.P > > If the kernel was built with exportfs support, the path to the target > > file is exposed as a file handle, via three hexadecimal fields: > > .IR fhandle\-bytes , > > .IR fhandle\-type , > > and > > .IR f_handle . > > -.IP > > +.P > > For fanotify file descriptors (see > > .BR fanotify (7)), > > we see (since Linux 3.8) > > the following fields: > > -.IP > > +.P > > .in +4n > > .EX > > pos: 0 > > @@ -190,7 +188,7 @@ fanotify flags:0 event\-flags:88002 > > fanotify ino:19264f sdev:800001 mflags:0 mask:1 ignored_mask:0 fhandle\-bytes:8 fhandle\-type:1 f_handle:4f261900a82dfd73 > > .EE > > .in > > -.IP > > +.P > > The fourth line displays information defined when the fanotify group > > was created via > > .BR fanotify_init (2): > > @@ -210,7 +208,7 @@ argument given to > > .BR fanotify_init (2) > > (expressed in hexadecimal). > > .RE > > -.IP > > +.P > > Each additional line shown in the file contains information > > about one of the marks in the fanotify group. > > Most of these fields are as for inotify, except: > > @@ -228,16 +226,16 @@ The events mask for this mark > > The mask of events that are ignored for this mark > > (expressed in hexadecimal). > > .RE > > -.IP > > +.P > > For details on these fields, see > > .BR fanotify_mark (2). > > -.IP > > +.P > > For timerfd file descriptors (see > > .BR timerfd (2)), > > we see (since Linux 3.17) > > .\" commit af9c4957cf212ad9cf0bee34c95cb11de5426e85 > > the following fields: > > -.IP > > +.P > > .in +4n > > .EX > > pos: 0 > > -- > > 2.47.0.rc1.288.g06298d1525-goog > > > > -- > <https://www.alejandro-colomar.es/>
Hi Ian, On Fri, Nov 01, 2024 at 11:19:18AM -0700, Ian Rogers wrote: > On Fri, Nov 1, 2024 at 6:24 AM Alejandro Colomar <alx@kernel.org> wrote: > > > > On Tue, Oct 15, 2024 at 02:17:17PM -0700, Ian Rogers wrote: > > > When /proc/pid/fdinfo was part of proc.5 man page the indentation made > > > sense. As a standalone man page the indentation doesn't need to be so > > > far over to the right. Remove the initial tagged pragraph and move the > > > styling to the initial summary description. > > > > > > Suggested-by: G. Branden Robinson <g.branden.robinson@gmail.com> > > > Signed-off-by: Ian Rogers <irogers@google.com> > > > --- > > > man/man5/proc_pid_fdinfo.5 | 66 ++++++++++++++++++-------------------- > > > 1 file changed, 32 insertions(+), 34 deletions(-) > > > > > > diff --git a/man/man5/proc_pid_fdinfo.5 b/man/man5/proc_pid_fdinfo.5 > > > index 1e23bbe02..8678caf4a 100644 > > > --- a/man/man5/proc_pid_fdinfo.5 > > > +++ b/man/man5/proc_pid_fdinfo.5 > > > @@ -6,20 +6,19 @@ > > > .\" > > > .TH proc_pid_fdinfo 5 (date) "Linux man-pages (unreleased)" > > > .SH NAME > > > -/proc/pid/fdinfo/ \- information about file descriptors > > > +.IR /proc/ pid /fdinfo " \- information about file descriptors" > > > > I wouldn't add formatting here for now. That's something I prefer to be > > cautious about, and if we do it, we should do it in a separate commit. > > I'll move it to a separate patch. Is the caution due to a lack of test > infrastructure? That could be something to get resolved, perhaps > through Google summer-of-code and the like. That change might be controversial. We'd first need to check that all software that reads the NAME section would behave well for this. Also, many other pages might need to be changed accordingly for consistency. For testing infrastructure I think we're good. The makefile already does a lot of testing. > > > > .SH DESCRIPTION > > > -.TP > > > -.IR /proc/ pid /fdinfo/ " (since Linux 2.6.22)" > > > -This is a subdirectory containing one entry for each file which the > > > -process has open, named by its file descriptor. > > > -The files in this directory are readable only by the owner of the process. > > > -The contents of each file can be read to obtain information > > > -about the corresponding file descriptor. > > > -The content depends on the type of file referred to by the > > > -corresponding file descriptor. > > > -.IP > > > +Since Linux 2.6.22, > > > > You could move this information to a HISTORY section. > > Sure, tbh I'm not sure anybody cares about this information and it > could be as well to delete it. Sorry people running 17 year old > kernels. For now I'll try to leave it unchanged. I would like to keep it in HISTORY. You never know when it'll be useful and it's just one line or a few; it won't hurt. > > > > +this subdirectory contains one entry for each file that process > > > +.I pid > > > +has open, named by its file descriptor. The files in this directory > > > > Please don't reflow existing text. Please read about semantic newlines > > in man-pages(7): > > > > $ MANWIDTH=72 man man-pages | sed -n '/Use semantic newlines/,/^$/p' > > Use semantic newlines > > In the source of a manual page, new sentences should be started > > on new lines, long sentences should be split into lines at clause > > breaks (commas, semicolons, colons, and so on), and long clauses > > should be split at phrase boundaries. This convention, sometimes > > known as "semantic newlines", makes it easier to see the effect > > of patches, which often operate at the level of individual sen‐ > > tences, clauses, or phrases. > > I'll update for v3 but I'm reminded of `git diff --word-diff=color` so > perhaps this recommendation is outdated. No, this isn't outdated, since that reduces the quality of the diff. Also, I review a lot of patches in the mail client, without running git(1). And it's not just for reviewing diffs, but also for writing them. Semantic newlines reduce the amount of work for producing the diffs. And lastly, the source code reads much better if it's logically divided in phrases. > > Thanks, > Ian
[adding Colin Watson to CC; and the groff list because I started musing] Hi Alex, At 2024-11-01T21:07:29+0100, Alejandro Colomar wrote: > > > > -/proc/pid/fdinfo/ \- information about file descriptors > > > > +.IR /proc/ pid /fdinfo " \- information about file descriptors" > > > > > > I wouldn't add formatting here for now. That's something I prefer > > > to be cautious about, and if we do it, we should do it in a > > > separate commit. > > > > I'll move it to a separate patch. Is the caution due to a lack of > > test infrastructure? That could be something to get resolved, > > perhaps through Google summer-of-code and the like. > > That change might be controversial. Then let those with objections step forward and make them! (I may be one of them; see below.) > We'd first need to check that all software that reads the NAME section > would behave well for this. Not _all_ software, surely. Anybody can write a craptastic man(7) scraper, and several have, mainly back when Web 1.0 was going to eat the world. Most of those have withered on the vine. This is the _Linux_ man-pages project, so what matters are (1) man page formatters and (2) man page indexers that GNU/Linux systems actually use. Where people get nervous with the "NAME" section is because of the indexer; if one's man(7) _formatter_ can't handle an `IR` call, it hasn't earned the name. Here's a sample input. $ cat /tmp/proc_pid_fdinfo_mini.5 .TH proc_pid_fdinfo_mini 5 2024-11-02 "example" .SH Name .IR /proc/ pid /fdinfo " \- information about file descriptors" .SH Description Text text text text. Starting with formatters, let's see how they do. $ nroff -man /tmp/proc_pid_fdinfo_mini.5 proc_pid_fdinfo_mini(5) File Formats Manual proc_pid_fdinfo_mini(5) Name /proc/pid/fdinfo - information about file descriptors Description Text text text text. example 2024‐11‐02 proc_pid_fdinfo_mini(5) $ mandoc /tmp/proc_pid_fdinfo_mini.5 | ul proc_pid_fdinfo_mini(5) File Formats Manual proc_pid_fdinfo_mini(5) Name /proc/pid/fdinfo - information about file descriptors Description Text text text text. example 2024-11-02 proc_pid_fdinfo_mini(5) $ ~/heirloom/bin/nroff -man /tmp/proc_pid_fdinfo_mini.5 | ul proc_pid_fdinfo_mini(5) File Formats Manual proc_pid_fdinfo_mini(5) Name /proc/pid/fdinfo - information about file descriptors Description Text text text text. example 2024-11-02 proc_pid_fdinfo_mini(5) $ DWBHOME=~/dwb ~/dwb/bin/nroff -man /tmp/proc_pid_fdinfo_mini.5 | cat -s | ul proc_pid_fdinfo_mini(5)example (2024-11-02)roc_pid_fdinfo_mini(5) Name /proc/pid/fdinfo - information about file descriptors Description Text text text text. Page 1 (printed 11/2/2024) I leave the execution of these to perceive the correct font style changes as an exercise for the reader, but they all get the "/proc/pid/fdinfo" line right. On GNU/Linux systems, the only man page indexer I know of is Colin Watson's man-db--specifically, its mandb(8) program. But it's nicely designed so that the "topic and summary description extraction" task is delegated to a standalone tool, lexgrog(1), and we can use that. $ lexgrog /tmp/proc_pid_fdinfo_mini.5 /tmp/proc_pid_fdinfo_mini.5: parse failed Oh, damn. I wasn't expecting that. Maybe this is what defeats Michael Kerrisk's scraper with respect to groff's man pages.[1] Well, I can find a silver lining here, because it gives me an even better reason than I had to pitch an idea I've been kicking around for a while. Why not enhance groff man(7) to support a mode where _it_ will spit out the "Name"/"NAME" section, and only that, _for_ you? This would be as easy as checking for an option, say '-d EXTRACT=Name', and having the package's "TH" and "SH" macro definitions divert (literally, with the `di` request) everything _except_ the section of interest to a diversion that is then never called/output. (This is similar to an m4 feature known as the "black hole diversion".) All of the features necessary to implement this[2] were part of troff as far as back as the birth of the man(7) package itself. It's not clear to me why it wasn't done back in the 1980s. lexgrog(1) itself will of course have to stay around for years to come, but this could take a significant distraction off of Colin's plate--I believe I have seen him grumble about how much *roff syntax he has to parse to have the feature be workable, and that's without upstart groff maintainers exploring up to every boundary that existed even in 1979 and cheerfully exercising their findings in man pages. I also of course have ideas for generalizing the feature, so that you can request any (sub)section by name, and, with a bit more ambition,[4] paragraph tags (`TP`) too. So you could do things like: nroff -man -d EXTRACT="RETURN VALUE" man3/bsearch.3 and: nroff -man -d EXTRACT="OPTIONS/-b" man8/zic.8 ...does this sound appetizing to anyone? > Also, many other pages might need to be changed accordingly for > consistency. I withdraw the suggestion until lexgrog(1) flexes its own muscles, or has groff(1) do the lifting. I'm sorry for prompting churn, Ian. > No, this isn't outdated, since that reduces the quality of the diff. > Also, I review a lot of patches in the mail client, without running > git(1). And it's not just for reviewing diffs, but also for writing > them. Semantic newlines reduce the amount of work for producing the > diffs. It's a real win for diffs. Here's a very recent example from groff. diff --git a/man/groff.7.man b/man/groff.7.man index 1fb635f2b..1d248b237 100644 --- a/man/groff.7.man +++ b/man/groff.7.man @@ -1281,6 +1281,7 @@ .SH Identifiers typeface, color, special character or character class, +hyphenation language code, environment, or stream. . (So recent that in fact I haven't pushed that yet.) Lists like the foregoing are common in man pages. Regards, Branden [1] https://man7.org/linux/man-pages/dir_by_project.html#groff [2] String definitions, "string comparisons"[3], and diversions. [3] strictly, "formatted output comparisons" https://www.gnu.org/software/groff/manual/groff.html.node/Operators-in-Conditionals.html You can do stricter string comparisons in GNU troff. And I've thought of some syntactic sugar for performing them that wouldn't break backward compatibility. [4] To really land the feature, we need automatic tag generation from input text (we don't want to make the man page author construct their own tags). Another reason we want the construction to be automatic is to make the tags unique when multiple man pages are formatted in one run, as one might do when making a book of man pages. Automatic tagging will also enable the slaying of two other ancient dragons. 1. deep internal links for PDF bookmarks 2. pod2man's `IX`-happy output; the widespread use of this nonstandard macro confuses way too many novice page authors, and bloats document size. Another feature we'll really want to do this right is improved string processing facilities. That, too, is something that will pay dividends in several areas. With a proper string iterator in the formatter (and a couple more conditional operators),[5] it will be possible to write a string library as a macro file, slimming down the formatter itself a little and making macro writers' lives easier. We're only two days into the month and this has already come up on the groff list. https://lists.gnu.org/archive/html/groff/2024-11/msg00002.html [5] https://savannah.gnu.org/bugs/?62264
Hi Branden, On Sat, Nov 02, 2024 at 05:08:37AM -0500, G. Branden Robinson wrote: > [adding Colin Watson to CC; and the groff list because I started musing] > > Hi Alex, > > At 2024-11-01T21:07:29+0100, Alejandro Colomar wrote: > > > > > -/proc/pid/fdinfo/ \- information about file descriptors > > > > > +.IR /proc/ pid /fdinfo " \- information about file descriptors" > > > > > > > > I wouldn't add formatting here for now. That's something I prefer > > > > to be cautious about, and if we do it, we should do it in a > > > > separate commit. > > > > > > I'll move it to a separate patch. Is the caution due to a lack of > > > test infrastructure? That could be something to get resolved, > > > perhaps through Google summer-of-code and the like. > > > > That change might be controversial. > > Then let those with objections step forward and make them! Sure! But that in itself (and the length of your mail) makes a strong reason to have this in a separate commit. :) I'm not opposed to the change. Only cautious. > > (I may be one of them; see below.) > > > We'd first need to check that all software that reads the NAME section > > would behave well for this. > > Not _all_ software, surely. Anybody can write a craptastic man(7) > scraper, and several have, mainly back when Web 1.0 was going to eat the > world. Most of those have withered on the vine. Ahh, yeah, I committed the same mistake I criticise in others every now and then. $all does not really mean "all". (-Wall, `make all`, ...) I meant all [of which I care], which is basically groff(1) and mandoc(1). :) > This is the _Linux_ man-pages project, so what matters are (1) man page > formatters and (2) man page indexers that GNU/Linux systems actually > use. Where people get nervous with the "NAME" section is because of the > indexer; if one's man(7) _formatter_ can't handle an `IR` call, it > hasn't earned the name. Yup. > > Here's a sample input. > > $ cat /tmp/proc_pid_fdinfo_mini.5 > .TH proc_pid_fdinfo_mini 5 2024-11-02 "example" > .SH Name > .IR /proc/ pid /fdinfo " \- information about file descriptors" > .SH Description > Text text text text. > > Starting with formatters, let's see how they do. > > $ nroff -man /tmp/proc_pid_fdinfo_mini.5 > proc_pid_fdinfo_mini(5) File Formats Manual proc_pid_fdinfo_mini(5) > > Name > /proc/pid/fdinfo - information about file descriptors > > Description > Text text text text. > > example 2024‐11‐02 proc_pid_fdinfo_mini(5) > $ mandoc /tmp/proc_pid_fdinfo_mini.5 | ul > proc_pid_fdinfo_mini(5) File Formats Manual proc_pid_fdinfo_mini(5) > > Name > /proc/pid/fdinfo - information about file descriptors > > Description > Text text text text. > > example 2024-11-02 proc_pid_fdinfo_mini(5) > $ ~/heirloom/bin/nroff -man /tmp/proc_pid_fdinfo_mini.5 | ul > proc_pid_fdinfo_mini(5) File Formats Manual proc_pid_fdinfo_mini(5) > > > > Name > /proc/pid/fdinfo - information about file descriptors > > Description > Text text text text. > > > > example 2024-11-02 proc_pid_fdinfo_mini(5) > $ DWBHOME=~/dwb ~/dwb/bin/nroff -man /tmp/proc_pid_fdinfo_mini.5 | cat -s | ul > > proc_pid_fdinfo_mini(5)example (2024-11-02)roc_pid_fdinfo_mini(5) > > Name > /proc/pid/fdinfo - information about file descriptors > > Description > Text text text text. > > Page 1 (printed 11/2/2024) > > I leave the execution of these to perceive the correct font style > changes as an exercise for the reader, but they all get the > "/proc/pid/fdinfo" line right. > > On GNU/Linux systems, the only man page indexer I know of is Colin > Watson's man-db--specifically, its mandb(8) program. But it's nicely > designed so that the "topic and summary description extraction" task is > delegated to a standalone tool, lexgrog(1), and we can use that. > > $ lexgrog /tmp/proc_pid_fdinfo_mini.5 > /tmp/proc_pid_fdinfo_mini.5: parse failed > > Oh, damn. I wasn't expecting that. Maybe this is what defeats Michael > Kerrisk's scraper with respect to groff's man pages.[1] > > Well, I can find a silver lining here, because it gives me an even > better reason than I had to pitch an idea I've been kicking around for a > while. Why not enhance groff man(7) to support a mode where _it_ will > spit out the "Name"/"NAME" section, and only that, _for_ you? > > This would be as easy as checking for an option, say '-d EXTRACT=Name', > and having the package's "TH" and "SH" macro definitions divert > (literally, with the `di` request) everything _except_ the section of > interest to a diversion that is then never called/output. (This is > similar to an m4 feature known as the "black hole diversion".) Sounds good. And then lexgrog(1) would be a one-liner that calls groff(1) with the appropriate flag, right? > All of the features necessary to implement this[2] were part of troff as > far as back as the birth of the man(7) package itself. It's not clear > to me why it wasn't done back in the 1980s. Not enough energy of activation, probably, as with most stuff. > lexgrog(1) itself will of course have to stay around for years to come, You can make it a wrapper around groff(1) with flags, no? > but this could take a significant distraction off of Colin's plate--I > believe I have seen him grumble about how much *roff syntax he has to > parse to have the feature be workable, and that's without upstart groff > maintainers exploring up to every boundary that existed even in 1979 and > cheerfully exercising their findings in man pages. > > I also of course have ideas for generalizing the feature, so that you > can request any (sub)section by name, and, with a bit more ambition,[4] > paragraph tags (`TP`) too. > > So you could do things like: > > nroff -man -d EXTRACT="RETURN VALUE" man3/bsearch.3 I certainly use this. # man_section() prints specific manual page sections (DESCRIPTION, SYNOPSIS, # ...) of all manual pages in a directory (or in a single manual page file). # Usage example: .../man-pages$ man_section man2 SYNOPSIS 'SEE ALSO'; man_section() { if [ $# -lt 2 ]; then >&2 echo "Usage: ${FUNCNAME[0]} <dir> <section>..."; return $EX_USAGE; fi local page="$1"; shift; local sect="$*"; find "$page" -type f \ |xargs wc -l \ |grep -v -e '\b1 ' -e '\btotal\b' \ |awk '{ print $2 }' \ |sort \ |while read -r manpage; do (sed -n '/^\.TH/,/^\.SH/{/^\.SH/!p}' <"$manpage"; for s in $sect; do <"$manpage" \ sed -n \ -e "/^\.SH $s/p" \ -e "/^\.SH $s/,/^\.SH/{/^\.SH/!p}"; done;) \ |mandoc -Tutf8 2>/dev/null \ |col -pbx; done; } # man_lsfunc() prints the name of all C functions declared in the SYNOPSIS # of all manual pages in a directory (or in a single manual page file). # Each name is printed in a separate line # Usage example: .../man-pages$ man_lsfunc man2; man_lsfunc() { if [ $# -lt 1 ]; then >&2 echo "Usage: ${FUNCNAME[0]} <manpage|manNdir>..."; return $EX_USAGE; fi for arg in "$@"; do man_section "$arg" 'SYNOPSIS'; done \ |sed_rm_ccomments \ |pcregrep -Mn '(?s)^ [\w ]+ \**\w+\([\w\s(,)[\]*]*?(...)?\s*\); *$' \ |grep '^[0-9]' \ |sed -E 's/syscall\(SYS_(\w*),?/\1(/' \ |sed -E 's/^[^(]+ \**(\w+)\(.*/\1/' \ |uniq; } # man_lsvar() prints the name of all C variables declared in the SYNOPSIS # of all manual pages in a directory (or in a single manual page file). # Each name is printed in a separate line # Usage example: .../man-pages$ man_lsvar man3; man_lsvar() { if [ $# -lt 1 ]; then >&2 echo "Usage: ${FUNCNAME[0]} <manpage|manNdir>..."; return $EX_USAGE; fi for arg in "$@"; do man_section "$arg" 'SYNOPSIS'; done \ |sed_rm_ccomments \ |pcregrep -Mv '(?s)^ [\w ]+ \**\w+\([\w\s(,)[\]*]+?(...)?\s*\); *$' \ |pcregrep -Mn \ -e '(?s)^ +extern [\w ]+ \**\(\*+[\w ]+\)\([\w\s(,)[\]*]+?\s*\); *$' \ -e '^ +extern [\w ]+ \**[\w ]+; *$' \ |grep '^[0-9]' \ |grep -v 'typedef' \ |sed -E 's/^[0-9]+: +extern [^(]+ \**\(\*+(\w* )?(\w+)\)\(.*/\2/' \ |sed 's/^[0-9]\+: \+extern .* \**\(\w\+\); */\1/' \ |uniq; } Even grepc(1) derived from those scripts. > > and: > > nroff -man -d EXTRACT="OPTIONS/-b" man8/zic.8 While I haven't used this yet, it's probably because it's quite complex to implement with regexes, not because it wouldn't be useful. > > ...does this sound appetizing to anyone? Certainly. > > Also, many other pages might need to be changed accordingly for > > consistency. > > I withdraw the suggestion until lexgrog(1) flexes its own muscles, or > has groff(1) do the lifting. I'm sorry for prompting churn, Ian. > > > No, this isn't outdated, since that reduces the quality of the diff. > > Also, I review a lot of patches in the mail client, without running > > git(1). And it's not just for reviewing diffs, but also for writing > > them. Semantic newlines reduce the amount of work for producing the > > diffs. > > It's a real win for diffs. And diffs are a real win for text. Thus, semantic newlines are a real win for text. "Write poems, not prose." (Any chance we may get that warning added to groff(1)? :D) Cheers, Alex > > Here's a very recent example from groff. > > diff --git a/man/groff.7.man b/man/groff.7.man > index 1fb635f2b..1d248b237 100644 > --- a/man/groff.7.man > +++ b/man/groff.7.man > @@ -1281,6 +1281,7 @@ .SH Identifiers > typeface, > color, > special character or character class, > +hyphenation language code, > environment, > or stream. > . > > > (So recent that in fact I haven't pushed that yet.) > > Lists like the foregoing are common in man pages. > > Regards, > Branden > > [1] https://man7.org/linux/man-pages/dir_by_project.html#groff > [2] String definitions, "string comparisons"[3], and diversions. > [3] strictly, "formatted output comparisons" > > https://www.gnu.org/software/groff/manual/groff.html.node/Operators-in-Conditionals.html > > You can do stricter string comparisons in GNU troff. And I've > thought of some syntactic sugar for performing them that wouldn't > break backward compatibility. > > [4] To really land the feature, we need automatic tag generation from > input text (we don't want to make the man page author construct > their own tags). Another reason we want the construction to be > automatic is to make the tags unique when multiple man pages are > formatted in one run, as one might do when making a book of man > pages. Automatic tagging will also enable the slaying of two other > ancient dragons. > > 1. deep internal links for PDF bookmarks > 2. pod2man's `IX`-happy output; the widespread use of this > nonstandard macro confuses way too many novice page authors, and > bloats document size. > > Another feature we'll really want to do this right is improved string > processing facilities. That, too, is something that will pay > dividends in several areas. With a proper string iterator in the > formatter (and a couple more conditional operators),[5] it will be > possible to write a string library as a macro file, slimming down the > formatter itself a little and making macro writers' lives easier. > We're only two days into the month and this has already come up on > the groff list. > > https://lists.gnu.org/archive/html/groff/2024-11/msg00002.html > > [5] https://savannah.gnu.org/bugs/?62264
On Sat, Nov 02, 2024 at 05:08:37AM -0500, G. Branden Robinson wrote: > On GNU/Linux systems, the only man page indexer I know of is Colin > Watson's man-db--specifically, its mandb(8) program. But it's nicely > designed so that the "topic and summary description extraction" task is > delegated to a standalone tool, lexgrog(1), and we can use that. > > $ lexgrog /tmp/proc_pid_fdinfo_mini.5 > /tmp/proc_pid_fdinfo_mini.5: parse failed > > Oh, damn. I wasn't expecting that. Maybe this is what defeats Michael > Kerrisk's scraper with respect to groff's man pages.[1] How embarrassing. Could somebody please file a bug on https://gitlab.com/man-db/man-db/-/issues to remind me to fix that? (Of course there'll be a lead time for fixes to get into distributions.) > Well, I can find a silver lining here, because it gives me an even > better reason than I had to pitch an idea I've been kicking around for a > while. Why not enhance groff man(7) to support a mode where _it_ will > spit out the "Name"/"NAME" section, and only that, _for_ you? > > This would be as easy as checking for an option, say '-d EXTRACT=Name', > and having the package's "TH" and "SH" macro definitions divert > (literally, with the `di` request) everything _except_ the section of > interest to a diversion that is then never called/output. (This is > similar to an m4 feature known as the "black hole diversion".) > > All of the features necessary to implement this[2] were part of troff as > far as back as the birth of the man(7) package itself. It's not clear > to me why it wasn't done back in the 1980s. > > lexgrog(1) itself will of course have to stay around for years to come, > but this could take a significant distraction off of Colin's plate--I > believe I have seen him grumble about how much *roff syntax he has to > parse to have the feature be workable, and that's without upstart groff > maintainers exploring up to every boundary that existed even in 1979 and > cheerfully exercising their findings in man pages. lexgrog(1) is a useful (if oddly-named, sorry) debugging tool, but if you focus on that then you'll end up with a design that's not very useful. What really matters is indexing the whole system's manual pages, and mandb(8) does not do that by invoking lexgrog(1) one page at a time, but rather by running more or less the same code in-process. I already know that getting acceptable performance for this requires care, as illustrated by one of the NEWS entries for man-db 2.10.0: * Significantly improve `mandb(8)` and `man -K` performance in the common case where pages are of moderate size and compressed using `zlib`: `mandb -c` goes from 344 seconds to 10 seconds on a test system. ... so I'm prepared to bet that forking nroff one page at a time will be unacceptably slow. (This also combines with the fact that man-db applies some sandboxing when it's calling nroff just in case it might happen that a moderately-sized C++ project has less than 100% perfect security when doing text processing, which I'm sure everyone agrees would never happen.) If it were possible to run nroff over a whole batch of pages and get output for each of them in one go, then maaaaybe. man-db would need a reliable way to associate each line (or sometimes multiple lines) of output with each source file, and of course care would be needed around error handling and so on. I can see the appeal, in terms of processing the actual language rather than a pile of hacks that try to guess what to do with it - but on the other hand this starts to feel like a much less natural fit for the way nroff is run in every other situation, where you're processing one document at a time. Cheers,
Hi Branden, Colin, On Sat, Nov 02, 2024 at 11:40:13AM +0100, Alejandro Colomar wrote: > > I also of course have ideas for generalizing the feature, so that you > > can request any (sub)section by name, and, with a bit more ambition,[4] > > paragraph tags (`TP`) too. > > > > So you could do things like: > > > > nroff -man -d EXTRACT="RETURN VALUE" man3/bsearch.3 > > I certainly use this. > > # man_section() prints specific manual page sections (DESCRIPTION, SYNOPSIS, > # ...) of all manual pages in a directory (or in a single manual page file). > # Usage example: .../man-pages$ man_section man2 SYNOPSIS 'SEE ALSO'; > > man_section() > { > if [ $# -lt 2 ]; then > >&2 echo "Usage: ${FUNCNAME[0]} <dir> <section>..."; > return $EX_USAGE; > fi > > local page="$1"; > shift; > local sect="$*"; > > find "$page" -type f \ > |xargs wc -l \ > |grep -v -e '\b1 ' -e '\btotal\b' \ > |awk '{ print $2 }' \ > |sort \ > |while read -r manpage; do > (sed -n '/^\.TH/,/^\.SH/{/^\.SH/!p}' <"$manpage"; > for s in $sect; do > <"$manpage" \ > sed -n \ > -e "/^\.SH $s/p" \ > -e "/^\.SH $s/,/^\.SH/{/^\.SH/!p}"; > done;) \ > |mandoc -Tutf8 2>/dev/null \ > |col -pbx; > done; > } On the other hand, you may want to just package this small shell script (or rather a part of it) as a program. How about this? $ cat /usr/local/bin/mansect #!/bin/sh if [ $# -lt 1 ]; then >&2 echo "Usage: $0 SECTION [FILE ...]"; return 1; fi s="$1"; shift; if test -z "$*"; then sed -n \ -e '/^\.TH/,/^\.SH/{/^\.SH/!p}' \ -e '/^\.SH '"$s"'$/p' \ -e '/^\.SH '"$s"'$/,/^\.SH/{/^\.SH/!p}' \ ; else find "$@" -not -type d \ | xargs wc -l \ | sed '${/ total$/d}' \ | grep -v '\b1 ' \ | awk '{ print $2 }' \ | xargs -L1 sed -n \ -e '/^\.TH/,/^\.SH/{/^\.SH/!p}' \ -e '/^\.SH '"$s"'$/p' \ -e '/^\.SH '"$s"'$/,/^\.SH/{/^\.SH/!p}' \ ; fi; This only filters the source of the page, producing output that's suitable for the groff pipeline. alx@devuan:~$ man -w proc | xargs cat | mansect NAME .TH proc 5 2024-06-15 "Linux man-pages 6.9.1-158-g2ac94c631" .SH NAME proc \- process information, system information, and sysctl pseudo-filesystem alx@devuan:~$ man -w strtol strtoul | xargs mansect 'NAME' .TH strtol 3 2024-07-23 "Linux man-pages 6.9.1-158-g2ac94c631" .SH NAME strtol, strtoll, strtoq \- convert a string to a long integer .TH strtoul 3 2024-07-23 "Linux man-pages 6.9.1-158-g2ac94c631" .SH NAME strtoul, strtoull, strtouq \- convert a string to an unsigned long integer You can request several sections with a regex: $ man -w strtol strtoul | xargs mansect '\(NAME\|SEE ALSO\)' .TH strtol 3 2024-07-23 "Linux man-pages 6.9.1-158-g2ac94c631" .SH NAME strtol, strtoll, strtoq \- convert a string to a long integer .SH SEE ALSO .BR atof (3), .BR atoi (3), .BR atol (3), .BR strtod (3), .BR strtoimax (3), .BR strtoul (3) .TH strtoul 3 2024-07-23 "Linux man-pages 6.9.1-158-g2ac94c631" .SH NAME strtoul, strtoull, strtouq \- convert a string to an unsigned long integer .SH SEE ALSO .BR a64l (3), .BR atof (3), .BR atoi (3), .BR atol (3), .BR strtod (3), .BR strtol (3), .BR strtoumax (3) And it can then be piped to groff(1) to format the entire set of pages: $ man -w strtol strtoul | xargs mansect '\(NAME\|SEE ALSO\)' | groff -man -Tutf8 strtol(3) Library Functions Manual strtol(3) NAME strtol, strtoll, strtoq - convert a string to a long integer SEE ALSO atof(3), atoi(3), atol(3), strtod(3), strtoimax(3), strtoul(3) Linux man‐pages 6.9.1‐158‐g2ac... 2024‐07‐23 strtol(3) ─────────────────────────────────────────────────────────────────────────────── strtoul(3) Library Functions Manual strtoul(3) NAME strtoul, strtoull, strtouq - convert a string to an unsigned long integer SEE ALSO a64l(3), atof(3), atoi(3), atol(3), strtod(3), strtol(3), strtoumax(3) Linux man‐pages 6.9.1‐158‐g2ac... 2024‐07‐23 strtoul(3) This is quite naive, and will not work with pages that define their own stuff, since this script is not groff(1). But it should be as fast as is possible, which is what Colin wants, is as simple as it can be (and thus relatively safe), and should work with most pages (as far as indexing is concerned, probably all?). Have a lovely night! Alex
On Sat, Nov 02, 2024 at 10:36:20PM +0100, Alejandro Colomar wrote: > This is quite naive, and will not work with pages that define their own > stuff, since this script is not groff(1). But it should be as fast as > is possible, which is what Colin wants, is as simple as it can be (and > thus relatively safe), and should work with most pages (as far as > indexing is concerned, probably all?). I seem to be being invoked here for something I actually don't think I want at all, which suggests that wires have been crossed somewhere. Can you explain why I'd want to replace some part of a fairly well-optimized and established C program with a shell pipeline? I'm pretty certain it would not be faster, at least. Thanks,
Hi Colin, On Sat, Nov 02, 2024 at 11:47:14PM +0000, Colin Watson wrote: > On Sat, Nov 02, 2024 at 10:36:20PM +0100, Alejandro Colomar wrote: > > This is quite naive, and will not work with pages that define their own > > stuff, since this script is not groff(1). But it should be as fast as > > is possible, which is what Colin wants, is as simple as it can be (and > > thus relatively safe), and should work with most pages (as far as > > indexing is concerned, probably all?). > > I seem to be being invoked here for something I actually don't think I > want at all, which suggests that wires have been crossed somewhere. Can > you explain why I'd want to replace some part of a fairly well-optimized > and established C program with a shell pipeline? I'm pretty certain it > would not be faster, at least. Are you sure? With a small tweak, I get the following comparison: alx@devuan:~/src/linux/man-pages/man-pages/main$ time lexgrog man/*/* | wc lexgrog: can't resolve man7/groff_man.7 12475 99295 919842 real 0m6.166s user 0m5.132s sys 0m1.336s alx@devuan:~/src/linux/man-pages/man-pages/main$ time mansect NAME man/ \ | groff -man -Tutf8 | wc 9830 27109 689478 real 0m0.156s user 0m0.219s sys 0m0.019s Yes, I'm working with uncompressed pages. We'd need to add support for handling compressed pages. Also, we'd need to compare the performance of lexgrog(1) with compressed pages. But for a starter, this suggests some good performance. (I say with a small tweak, because the version I've posted uses xargs -L1, but I've tested for performance without the -L1, which is the main bottleneck. It has no consequences for the NAME. I need to work out some nasty details with sed -n1 for the generic version, though.) Have a lovely night! Alex
On Sun, Nov 03, 2024 at 01:05:42AM +0100, Alejandro Colomar wrote: > Hi Colin, > > On Sat, Nov 02, 2024 at 11:47:14PM +0000, Colin Watson wrote: > > On Sat, Nov 02, 2024 at 10:36:20PM +0100, Alejandro Colomar wrote: > > > This is quite naive, and will not work with pages that define their own > > > stuff, since this script is not groff(1). But it should be as fast as > > > is possible, which is what Colin wants, is as simple as it can be (and > > > thus relatively safe), and should work with most pages (as far as > > > indexing is concerned, probably all?). > > > > I seem to be being invoked here for something I actually don't think I > > want at all, which suggests that wires have been crossed somewhere. Can > > you explain why I'd want to replace some part of a fairly well-optimized > > and established C program with a shell pipeline? I'm pretty certain it > > would not be faster, at least. > > Are you sure? With a small tweak, I get the following comparison: > > alx@devuan:~/src/linux/man-pages/man-pages/main$ time lexgrog man/*/* | wc > lexgrog: can't resolve man7/groff_man.7 > 12475 99295 919842 > > real 0m6.166s > user 0m5.132s > sys 0m1.336s > alx@devuan:~/src/linux/man-pages/man-pages/main$ time mansect NAME man/ \ > | groff -man -Tutf8 | wc > 9830 27109 689478 > > real 0m0.156s > user 0m0.219s > sys 0m0.019s > > Yes, I'm working with uncompressed pages. We'd need to add support for > handling compressed pages. Also, we'd need to compare the performance > of lexgrog(1) with compressed pages. But for a starter, this suggests > some good performance. > > (I say with a small tweak, because the version I've posted uses > xargs -L1, but I've tested for performance without the -L1, which is > the main bottleneck. It has no consequences for the NAME. I need to > work out some nasty details with sed -n1 for the generic version, s/n1/n/ > though.) > > > Have a lovely night! > Alex > > -- > <https://www.alejandro-colomar.es/>
On Sun, Nov 03, 2024 at 01:05:34AM +0100, Alejandro Colomar wrote: > Are you sure? With a small tweak, I get the following comparison: > > alx@devuan:~/src/linux/man-pages/man-pages/main$ time lexgrog man/*/* | wc > lexgrog: can't resolve man7/groff_man.7 > 12475 99295 919842 Comparing anything to lexgrog isn't very interesting; it's a debugging tool and is not in itself very performance-sensitive. As I've explained elsewhere, the interesting thing is mandb, which uses the same code in-process to scan a whole tree of pages in one go. I do not expect to ever want to replace that with a shell pipeline.
I'm not trying to stop you committing whatever you want to your repository, of course, but I want to be clear that this doesn't actually solve the right problem for manual page indexing. The point of the parsing code in mandb(8) - and I'm not claiming that it's great code or the perfect design, just that it works most of the time - is to extract the names and summary-descriptions from each page so that they can be used by tools such as apropos(1) and whatis(1). Splitting on section boundaries is just the simplest part of that problem, and I don't think that doing it in a separate program really gains anything. (That's leaving aside things like localized man pages, which I know some folks on the groff list tend to sniff at but I think they're important, and the fact that the NAME section has both semantic and presentational meaning means that like it or not the parser needs to be aware of this.)
Hi Colin, At 2024-11-02T19:06:53+0000, Colin Watson wrote: > How embarrassing. Could somebody please file a bug on > https://gitlab.com/man-db/man-db/-/issues to remind me to fix that? Done; <https://gitlab.com/man-db/man-db/-/issues/46>. > lexgrog(1) is a useful (if oddly-named, sorry) debugging tool, but if > you focus on that then you'll end up with a design that's not very > useful. What really matters is indexing the whole system's manual > pages, and mandb(8) does not do that by invoking lexgrog(1) one page > at a time, but rather by running more or less the same code > in-process. Ah, I see it now--"lexgrog.l" is in both the Automake macros "lexgrog_SOURCES" and "mandb_SOURCES". Nice and DRY! > I already know that getting acceptable performance for > this requires care, as illustrated by one of the NEWS entries for > man-db 2.10.0: > > * Significantly improve `mandb(8)` and `man -K` performance in the > common case where pages are of moderate size and compressed using > `zlib`: `mandb -c` goes from 344 seconds to 10 seconds on a test > system. > > ... so I'm prepared to bet that forking nroff one page at a time will > be unacceptably slow. Probably, but there is little reason to run nroff that way (as of groff 1.23). It already works well, but I have ideas for further hardening groff's man(7) and mdoc(7) packages such that they return to a well-defined state when changing input documents. > (This also combines with the fact that man-db applies some sandboxing > when it's calling nroff just in case it might happen that a > moderately-sized C++ project has less than 100% perfect security when > doing text processing, which I'm sure everyone agrees would never > happen.) Inconceivable, yes! But fortunately you can run nroff over N documents and pay its own startup overhead costs as well as those of sandboxing only once. > If it were possible to run nroff over a whole batch of pages and get > output for each of them in one go, then maaaaybe. That's already true for formatting the entire page. It's how this was created. https://www.gnu.org/software/groff/manual/groff-man-pages.utf8.txt (...best viewed with "less -R") With the `-d EXTRACT` feature I have in mind, in its as-simple-as-possible first-cut form, the problem you anticipate... > man-db would need a reliable way to associate each line (or sometimes > multiple lines) of output with each source file, ...would remain. I'll have to think of a good way to write out "metadata" (the input file name and the arguments to the `TH` request) as each page is encountered, and of an interface to enable that. I don't see it happening before groff 1.25. > and of course care would be needed around error handling and so on. I need to give this thought, too. What sorts of error scenarios do you foresee? GNU troff itself, if it can't open a file to be formatted, reports an error diagnostic and continues to the next `argv` string until it reaches the end of input. > I can see the appeal, in terms of processing the actual language > rather than a pile of hacks that try to guess what to do with it ...a major selling point, IMO... > but on the other hand this starts to feel like a much less natural fit > for the way nroff is run in every other situation, where you're > processing one document at a time. This I disagree with. Or perhaps more precisely, it's another example of the exception (man(1)) swallowing the rule (nroff/troff). nroff and troff were written as Unix filters; they read the standard input stream (and/or argument list)[1], do some processing, and write to standard output.[2] Historically, troff (or one of its preprocessors) was commonly used with multiple input files to catenate them. Here's an example of this practice from 1980. https://minnie.tuhs.org/cgi-bin/utree.pl?file=3BSD/usr/doc/pascal/makefile Regards, Branden [1] ...including this option from Seventh Edition Unix (1979) or earlier, which survives in GNU troff to this day. -i Read standard input after the input files are exhausted. [2] Seventh Edition troff didn't write to stdout by default, but tried to open the typesetter device. But it had an option to write to standard output. -t Direct output to the standard output instead of the phototypesetter. Running old school Unix under emulation these days, you _have_ to use this option to avoid the dreaded "Typesetter busy." diagnostic. When Kernighan refactored troff for device-independence, he reseated it more squarely in the Unix filter tradition by writing its plain-text page description language to stdout. The output driver, such as "dpost" for PostScript, also read its standard input, and could thus become just one more stage in a pipeline. [CSTR #97]
Hi Colin, At 2024-11-03T00:47:23+0000, Colin Watson wrote: > (That's leaving aside things like localized man pages, which I know > some folks on the groff list tend to sniff I can think of only one, the maintainer of a rival formatter. ;-) > at but I think they're important, Me too. I agree with the sniffer that no language is ever likely to reach 100% parity with English in something like the Debian distribution, but more modest domains exist. I've put effort into l10n issues in man(7) and in groff generally. In particular, I really want seamless multilingual document support and achievement of that goal will be, I think, much closer in groff 1.24. (My pending push is gated on deciding how to change the me(7) and ms(7) packages to accommodate a formatter-level fix to an ugly wart in the l10n department; see <https://savannah.gnu.org/bugs/?66387>.) > and the fact that the NAME section has both semantic and > presentational meaning means that like it or not the parser needs to > be aware of this.) Even if mandb(8) doesn't run groff to extract the summary descriptions/ apropos lines, I think this feature might be useful to you for coverage/regression testing. Presumably, for valid inputs, groff and mandb(8) should reach similar conclusions about how the text of a "Name" section is to be formatted. Regards, Branden
(now with some local vim macros fixed to stop accidentally corrupting the To: lines of some of my outgoing emails ...) On Sat, Nov 02, 2024 at 08:09:29PM -0500, G. Branden Robinson wrote: > At 2024-11-03T00:47:23+0000, Colin Watson wrote: > > and the fact that the NAME section has both semantic and > > presentational meaning means that like it or not the parser needs to > > be aware of this.) > > Even if mandb(8) doesn't run groff to extract the summary descriptions/ > apropos lines, I think this feature might be useful to you for > coverage/regression testing. Presumably, for valid inputs, groff and > mandb(8) should reach similar conclusions about how the text of a "Name" > section is to be formatted. Yes, that's a good point and I agree with that.
On Sat, Nov 02, 2024 at 07:50:23PM -0500, G. Branden Robinson wrote: > At 2024-11-02T19:06:53+0000, Colin Watson wrote: > > How embarrassing. Could somebody please file a bug on > > https://gitlab.com/man-db/man-db/-/issues to remind me to fix that? > > Done; <https://gitlab.com/man-db/man-db/-/issues/46>. Thanks, working on it. > > I already know that getting acceptable performance for > > this requires care, as illustrated by one of the NEWS entries for > > man-db 2.10.0: > > > > * Significantly improve `mandb(8)` and `man -K` performance in the > > common case where pages are of moderate size and compressed using > > `zlib`: `mandb -c` goes from 344 seconds to 10 seconds on a test > > system. > > > > ... so I'm prepared to bet that forking nroff one page at a time will > > be unacceptably slow. > > Probably, but there is little reason to run nroff that way (as of groff > 1.23). It already works well, but I have ideas for further hardening > groff's man(7) and mdoc(7) packages such that they return to a > well-defined state when changing input documents. Being able to keep track of which output goes with which input pages is critical to the indexer, though (as you acknowledge later in your reply). It can't just throw the whole lot at nroff and call it a day. One other thing: mandb/lexgrog also looks for preprocessing filter hints in pages (`'\" te` and the like). This is obscure, to be sure, but either a replacement would need to do the same thing or we'd need to be certain that it's no longer required. > > and of course care would be needed around error handling and so on. > > I need to give this thought, too. What sorts of error scenarios do you > foresee? GNU troff itself, if it can't open a file to be formatted, > reports an error diagnostic and continues to the next `argv` string > until it reaches the end of input. That might be sufficient, or man-db might need to be able to detect which pages had errors. I'm not currently sure. > > but on the other hand this starts to feel like a much less natural fit > > for the way nroff is run in every other situation, where you're > > processing one document at a time. > > This I disagree with. Or perhaps more precisely, it's another example > of the exception (man(1)) swallowing the rule (nroff/troff). nroff and > troff were written as Unix filters; they read the standard input stream > (and/or argument list)[1], do some processing, and write to standard > output.[2] > > Historically, troff (or one of its preprocessors) was commonly used with > multiple input files to catenate them. But this application is not conceptually like catenation (even if it might be possible to implement it that way). The collection of all manual pages on a system is not like one long document that happens to be split over multiple files, certainly not from an indexer's point of view.
Hi Alex, At 2024-11-02T11:39:37+0100, Alejandro Colomar wrote: > And diffs are a real win for text. Thus, semantic newlines are a real > win for text. "Write poems, not prose." (Any chance we may get that > warning added to groff(1)? :D) Yes, but I've kicked it out to groff 1.25 because a gift-wrapped opportunity came along. We get to retire a warning category and its number. groff(7) [1.23.0]: Warnings ... el 16 The el request was encountered with no prior corresponding ie request. groff 1.24.0 [in preparation] NEWS: * The "el" warning category has been withdrawn. If enabled (which it was not by default), the formatter would emit a diagnostic if it inferred an imbalance between `ie` and `el` requests. Unfortunately its technique wasn't reliable and sometimes spuriously issued these warnings, and making it perfectly reliable did not look tractable. We recommend using brace escape sequences `\{` and `\}` to ensure that your control flow structures remain maintainable. This was a 35-year-old bug (or incomplete feature) in GNU troff that as far as I know first came to attention 10 years ago when the then-Heirloom Doctools maintainer pointed out an incompatibility between AT&T troff (from which Heirloom Doctools descends) and GNU troff. https://savannah.gnu.org/bugs/?45502 More recently, Paul Eggert scored big-time grognard points by actually depending on the AT&T troff behavior in the zic(8) man page. https://savannah.gnu.org/bugs/?65474 We therefore _had_ to fix it. The consequence is that the warning category `el` and bit 4 in the warning mask integer are undefined for groff 1.24. This was irresistible serendipity, because this warning category was (1) not enabled by default and (2) probably used only by people who wouldn't object to style warnings anyway. In groff 1.25, I want to revive bit 4 as new warning category `style`. Ending sentences before the end of a text line is something we can warn about as discussed a while back, and I plan to do so. https://lists.gnu.org/archive/html/groff/2022-06/msg00052.html I've been collecting specimens of other contemplated style warnings. https://savannah.gnu.org/bugs/?62776 Regards, Branden
diff --git a/man/man5/proc_pid_fdinfo.5 b/man/man5/proc_pid_fdinfo.5 index 1e23bbe02..8678caf4a 100644 --- a/man/man5/proc_pid_fdinfo.5 +++ b/man/man5/proc_pid_fdinfo.5 @@ -6,20 +6,19 @@ .\" .TH proc_pid_fdinfo 5 (date) "Linux man-pages (unreleased)" .SH NAME -/proc/pid/fdinfo/ \- information about file descriptors +.IR /proc/ pid /fdinfo " \- information about file descriptors" .SH DESCRIPTION -.TP -.IR /proc/ pid /fdinfo/ " (since Linux 2.6.22)" -This is a subdirectory containing one entry for each file which the -process has open, named by its file descriptor. -The files in this directory are readable only by the owner of the process. -The contents of each file can be read to obtain information -about the corresponding file descriptor. -The content depends on the type of file referred to by the -corresponding file descriptor. -.IP +Since Linux 2.6.22, +this subdirectory contains one entry for each file that process +.I pid +has open, named by its file descriptor. The files in this directory +are readable only by the owner of the process. The contents of each +file can be read to obtain information about the corresponding file +descriptor. The content depends on the type of file referred to by +the corresponding file descriptor. +.P For regular files and directories, we see something like: -.IP +.P .in +4n .EX .RB "$" " cat /proc/12015/fdinfo/4" @@ -28,7 +27,7 @@ flags: 01002002 mnt_id: 21 .EE .in -.IP +.P The fields are as follows: .RS .TP @@ -51,7 +50,6 @@ this field incorrectly displayed the setting of at the time the file was opened, rather than the current setting of the close-on-exec flag. .TP -.I .I mnt_id This field, present since Linux 3.15, .\" commit 49d063cb353265c3af701bab215ac438ca7df36d @@ -59,13 +57,13 @@ is the ID of the mount containing this file. See the description of .IR /proc/ pid /mountinfo . .RE -.IP +.P For eventfd file descriptors (see .BR eventfd (2)), we see (since Linux 3.8) .\" commit cbac5542d48127b546a23d816380a7926eee1c25 the following fields: -.IP +.P .in +4n .EX pos: 0 @@ -74,16 +72,16 @@ mnt_id: 10 eventfd\-count: 40 .EE .in -.IP +.P .I eventfd\-count is the current value of the eventfd counter, in hexadecimal. -.IP +.P For epoll file descriptors (see .BR epoll (7)), we see (since Linux 3.8) .\" commit 138d22b58696c506799f8de759804083ff9effae the following fields: -.IP +.P .in +4n .EX pos: 0 @@ -93,7 +91,7 @@ tfd: 9 events: 19 data: 74253d2500000009 tfd: 7 events: 19 data: 74253d2500000007 .EE .in -.IP +.P Each of the lines beginning .I tfd describes one of the file descriptors being monitored via @@ -110,13 +108,13 @@ descriptor. The .I data field is the data value associated with this file descriptor. -.IP +.P For signalfd file descriptors (see .BR signalfd (2)), we see (since Linux 3.8) .\" commit 138d22b58696c506799f8de759804083ff9effae the following fields: -.IP +.P .in +4n .EX pos: 0 @@ -125,7 +123,7 @@ mnt_id: 10 sigmask: 0000000000000006 .EE .in -.IP +.P .I sigmask is the hexadecimal mask of signals that are accepted via this signalfd file descriptor. @@ -135,12 +133,12 @@ and .BR SIGQUIT ; see .BR signal (7).) -.IP +.P For inotify file descriptors (see .BR inotify (7)), we see (since Linux 3.8) the following fields: -.IP +.P .in +4n .EX pos: 0 @@ -150,7 +148,7 @@ inotify wd:2 ino:7ef82a sdev:800001 mask:800afff ignored_mask:0 fhandle\-bytes:8 inotify wd:1 ino:192627 sdev:800001 mask:800afff ignored_mask:0 fhandle\-bytes:8 fhandle\-type:1 f_handle:27261900802dfd73 .EE .in -.IP +.P Each of the lines beginning with "inotify" displays information about one file or directory that is being monitored. The fields in this line are as follows: @@ -168,19 +166,19 @@ The ID of the device where the target file resides (in hexadecimal). .I mask The mask of events being monitored for the target file (in hexadecimal). .RE -.IP +.P If the kernel was built with exportfs support, the path to the target file is exposed as a file handle, via three hexadecimal fields: .IR fhandle\-bytes , .IR fhandle\-type , and .IR f_handle . -.IP +.P For fanotify file descriptors (see .BR fanotify (7)), we see (since Linux 3.8) the following fields: -.IP +.P .in +4n .EX pos: 0 @@ -190,7 +188,7 @@ fanotify flags:0 event\-flags:88002 fanotify ino:19264f sdev:800001 mflags:0 mask:1 ignored_mask:0 fhandle\-bytes:8 fhandle\-type:1 f_handle:4f261900a82dfd73 .EE .in -.IP +.P The fourth line displays information defined when the fanotify group was created via .BR fanotify_init (2): @@ -210,7 +208,7 @@ argument given to .BR fanotify_init (2) (expressed in hexadecimal). .RE -.IP +.P Each additional line shown in the file contains information about one of the marks in the fanotify group. Most of these fields are as for inotify, except: @@ -228,16 +226,16 @@ The events mask for this mark The mask of events that are ignored for this mark (expressed in hexadecimal). .RE -.IP +.P For details on these fields, see .BR fanotify_mark (2). -.IP +.P For timerfd file descriptors (see .BR timerfd (2)), we see (since Linux 3.17) .\" commit af9c4957cf212ad9cf0bee34c95cb11de5426e85 the following fields: -.IP +.P .in +4n .EX pos: 0
When /proc/pid/fdinfo was part of proc.5 man page the indentation made sense. As a standalone man page the indentation doesn't need to be so far over to the right. Remove the initial tagged pragraph and move the styling to the initial summary description. Suggested-by: G. Branden Robinson <g.branden.robinson@gmail.com> Signed-off-by: Ian Rogers <irogers@google.com> --- man/man5/proc_pid_fdinfo.5 | 66 ++++++++++++++++++-------------------- 1 file changed, 32 insertions(+), 34 deletions(-)