From patchwork Sat Aug 18 20:47:23 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Rasmus Villemoes <linux@rasmusvillemoes.dk>
X-Patchwork-Id: 10569651
Return-Path: <linux-kbuild-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E9F0B112E
	for <patchwork-linux-kbuild@patchwork.kernel.org>;
 Sat, 18 Aug 2018 20:47:41 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D533B299E0
	for <patchwork-linux-kbuild@patchwork.kernel.org>;
 Sat, 18 Aug 2018 20:47:41 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id C907829A2A; Sat, 18 Aug 2018 20:47:41 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham
	version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D8002299E0
	for <patchwork-linux-kbuild@patchwork.kernel.org>;
 Sat, 18 Aug 2018 20:47:35 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726496AbeHRX4c (ORCPT
        <rfc822;patchwork-linux-kbuild@patchwork.kernel.org>);
        Sat, 18 Aug 2018 19:56:32 -0400
Received: from mail-ed1-f66.google.com ([209.85.208.66]:44612 "EHLO
        mail-ed1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726490AbeHRX4c (ORCPT
        <rfc822;linux-kbuild@vger.kernel.org>);
        Sat, 18 Aug 2018 19:56:32 -0400
Received: by mail-ed1-f66.google.com with SMTP id f23-v6so6323409edr.11
        for <linux-kbuild@vger.kernel.org>;
 Sat, 18 Aug 2018 13:47:34 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=rasmusvillemoes.dk; s=google;
        h=from:to:cc:subject:date:message-id:in-reply-to;
        bh=HGgKq4MjM7BUYa1MqSUhzGEevoXYWMSrL2lGm2JrN7g=;
        b=D//RHAK8/oH7Ev0wk7ZlrDF0Lq5qQOQVlYhscb7DqF1CkrRv/zKZGvjZOHGNcohjqC
         FbZv1Ll5bd/G3SaMIrrxSsPKv1Od/qcHwrhYKaIhAkAPqwwixAdQ/jPrB6BvXxPMA3WL
         7WPA+4cG73REUAI0CKg0vxZGrlfseVqjiL9tY=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to;
        bh=HGgKq4MjM7BUYa1MqSUhzGEevoXYWMSrL2lGm2JrN7g=;
        b=dB6+rTPjs+cp48SOxqkH8sEv4A9tT9xN85lvfDOMjOKnioomViav1QrUSTX+FshVUS
         AC//zBzMo1/MofCVJFpCcECgh/XpgNCnQQy18F4Tn3Xkdj5Dp7ccRPbWIcIBvDX/osTo
         eGwSZ2AFdtQn50xyPuJcBot5GgErrCXNJMO8nzqlS6My5P5SJefIM27WOL+sJfJA2WDO
         qkAtfLH5CiGMkFk6Fpj+cKzgLfY+39k7Ur4Q+yw/GOlUq6e4CbipeHuV9UOsUTJOI37V
         AkAGueEcm6zREB9dk44sw6gAIBLla32Ym0ZVshqowad4fKxnUSCqd5mqA3dfysd/Dpz3
         zjuQ==
X-Gm-Message-State: AOUpUlHWAJERgLX6DGImD+cbmmQpOPHiYqVBtQDzZp0FvsNLwh7BIfHj
        17xEDEjboXejkIYgmKTCXeFnnHoFKlE=
X-Google-Smtp-Source: 
 AA+uWPykNxI6Str/AyU3xJdGcUWKERwv4EJmOVxeMqtpV2/SCqiS6fot2DzxcuDTG3rNKVbhRqq77g==
X-Received: by 2002:a50:8f84:: with SMTP id
 y4-v6mr49320940edy.71.1534625253460;
        Sat, 18 Aug 2018 13:47:33 -0700 (PDT)
Received: from prevas-ravi.waoo.dk (dhcp-5-186-114-212.cgn.ip.fibianet.dk.
 [5.186.114.212])
        by smtp.gmail.com with ESMTPSA id
 z56-v6sm3038119edz.54.2018.08.18.13.47.32
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
        Sat, 18 Aug 2018 13:47:32 -0700 (PDT)
From: Rasmus Villemoes <linux@rasmusvillemoes.dk>
To: Masahiro Yamada <yamada.masahiro@socionext.com>,
        Michal Marek <michal.lkml@markovi.net>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>,
        Ingo Molnar <mingo@kernel.org>, linux-kernel@vger.kernel.org,
        linux-kbuild@vger.kernel.org
Subject: [RFC PATCH] scripts: add header bloat measuring script
Date: Sat, 18 Aug 2018 22:47:23 +0200
Message-Id: <20180818204723.11060-1-linux@rasmusvillemoes.dk>
X-Mailer: git-send-email 2.16.4
In-Reply-To: 0180226075931.5vn4vdbfcsje2z56@gmail.com
Sender: linux-kbuild-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kbuild.vger.kernel.org>
X-Mailing-List: linux-kbuild@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

With a little cooperation from fixdep, we can rather easily quantify the
header bloat phenomenon.

While computing CONFIG_ dependencies, fixdep opens all the headers used
by a given translation unit anyway, so it's rather cheap to have it
record the number and total size of those in the generated .o.cmd file.

Those lines can then be post-processed and summarized by the new
header-bloat-stat.pl script. For example, backporting this to v4.17 and
v4.18 releases shows that for a defconfig x86_64 kernel, the median
"bloat factor" (total size of translation unit)/(size of .c file)
increased from 237.7 to 239.8, and the average total translation unit
size grew by 2.5% while the average .c file only increased by
0.4%. While these numbers by themselves are not particularly alarming,
when accumulated over several releases, builds do get noticably slower -
back at v3.0, the median bloat factor was 177.8.

Having infrastrucure like this makes it easier to measure the effect
should anyone attempt something similar to the sched.h cleanup, or just
go over a subsystem trimming unused #includes from .c files (if the
script is passed one or more directories it only processes those).

On a positive note, maybe 4.19 will be a rare exception; as of
1f7a4c73a739, the median bloat factor is down to 236.0, the average .c
file has increased by 0.4% but the average total translation unit is
nevertheless 1.2% smaller, compared to v4.18.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
---
For some statistics, that also include build times, for releases v3.0
through v4.15, see https://wildmoose.dk/header-bloat/ . I'm not sure
that page will remain forever, so not including the url in the commit
log.

I can certainly understand if people feel this is of too little
utility to hook into fixdep like this. It's certainly possible to do
the same statistics with external tools that just parse the .o.cmd
files themselves.

 scripts/basic/fixdep.c       | 18 +++++++--
 scripts/header-bloat-stat.pl | 95 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 109 insertions(+), 4 deletions(-)
 create mode 100755 scripts/header-bloat-stat.pl

diff --git a/scripts/basic/fixdep.c b/scripts/basic/fixdep.c
index 850966f3d602..f1dec85cf9d9 100644
--- a/scripts/basic/fixdep.c
+++ b/scripts/basic/fixdep.c
@@ -248,7 +248,7 @@ static void parse_config_file(const char *p)
 	}
 }
 
-static void *read_file(const char *filename)
+static void *read_file(const char *filename, unsigned *size)
 {
 	struct stat st;
 	int fd;
@@ -276,6 +276,8 @@ static void *read_file(const char *filename)
 	}
 	buf[st.st_size] = '\0';
 	close(fd);
+	if (size)
+		*size += st.st_size;
 
 	return buf;
 }
@@ -300,6 +302,8 @@ static void parse_dep_file(char *m, const char *target, int insert_extra_deps)
 	int saw_any_target = 0;
 	int is_first_dep = 0;
 	void *buf;
+	unsigned nheaders = 0, c_size = 0, h_size = 0;
+	unsigned *sizevar;
 
 	while (1) {
 		/* Skip any "white space" */
@@ -321,6 +325,8 @@ static void parse_dep_file(char *m, const char *target, int insert_extra_deps)
 			/* The /next/ file is the first dependency */
 			is_first_dep = 1;
 		} else if (!is_ignored_file(m, p - m)) {
+			sizevar = NULL;
+
 			*p = '\0';
 
 			/*
@@ -343,13 +349,16 @@ static void parse_dep_file(char *m, const char *target, int insert_extra_deps)
 					printf("source_%s := %s\n\n",
 					       target, m);
 					printf("deps_%s := \\\n", target);
+					sizevar = &c_size;
 				}
 				is_first_dep = 0;
 			} else {
 				printf("  %s \\\n", m);
+				sizevar = &h_size;
+				nheaders++;
 			}
 
-			buf = read_file(m);
+			buf = read_file(m, sizevar);
 			parse_config_file(buf);
 			free(buf);
 		}
@@ -373,7 +382,8 @@ static void parse_dep_file(char *m, const char *target, int insert_extra_deps)
 		do_extra_deps();
 
 	printf("\n%s: $(deps_%s)\n\n", target, target);
-	printf("$(deps_%s):\n", target);
+	printf("$(deps_%s):\n\n", target);
+	printf("# header-stats: %u %u %u\n", nheaders, c_size, h_size);
 }
 
 int main(int argc, char *argv[])
@@ -394,7 +404,7 @@ int main(int argc, char *argv[])
 
 	printf("cmd_%s := %s\n\n", target, cmdline);
 
-	buf = read_file(depfile);
+	buf = read_file(depfile, NULL);
 	parse_dep_file(buf, target, insert_extra_deps);
 	free(buf);
 
diff --git a/scripts/header-bloat-stat.pl b/scripts/header-bloat-stat.pl
new file mode 100755
index 000000000000..528021907df1
--- /dev/null
+++ b/scripts/header-bloat-stat.pl
@@ -0,0 +1,95 @@
+#!/usr/bin/perl
+
+use strict;
+use warnings;
+
+use Getopt::Long;
+use File::Find;
+use Statistics::Descriptive;
+
+sub help {
+    printf "%s [-c] [-m] [-n <name>] [<dirs>]\n", $0;
+    printf "  -c  output a single line with data in columns\n";
+    printf "  -m  include min/max statistics\n";
+    printf "  -n  optional name (e.g. git revision) to use as first datum\n";
+    exit(0);
+}
+
+my $name;
+my $minmax = 0;
+my $column = 0;
+
+GetOptions("c|column" => \$column,
+	   "m|minmax" => \$minmax,
+	   "n|name=s" => \$name,
+	   "h|help"   => \&help)
+    or die "Bad option";
+
+my @stats =
+    (
+     ['mean',   sub {$_[0]->mean()}],
+     ['min',    sub {$_[0]->min()}],
+     ['q25',    sub {$_[0]->quantile(1)}],
+     ['median', sub {$_[0]->quantile(2)}],
+     ['q75',    sub {$_[0]->quantile(3)}],
+     ['max',    sub {$_[0]->max()}],
+    );
+
+my @scalars = ('hcount', 'csize', 'tsize', 'ratio');
+my %data;
+my @out;
+
+find({wanted => \&process_cmd_file, no_chdir => 1}, @ARGV ? @ARGV : '.');
+
+add_output('name', $name) if $name;
+add_output('#TUs', $data{ntu});
+for my $s (@scalars) {
+    my $vals = Statistics::Descriptive::Full->new();
+    $vals->add_data(@{$data{$s}});
+    $vals->sort_data();
+    for my $stat (@stats) {
+	next if $s eq 'ratio' && $stat->[0] eq 'mean';
+	next if $stat->[0] =~ m/^(min|max)$/ && !$minmax;
+	my $val = $stat->[1]->($vals);
+	add_output($s . "_" . $stat->[0], $val);
+    }
+}
+
+if ($column) {
+    print join("\t", map {$_->[1]} @out), "\n";
+} else {
+    printf "%s\t%s\n", @$_ for @out;
+}
+
+sub add_output {
+    push @out, [@_];
+}
+
+sub process_cmd_file {
+    # Remove leading ./ components
+    s|^(\./)*||;
+    # Stuff that includes userspace/host headers is not interesting.
+    if (m/^(scripts|tools)/) {
+	$File::Find::prune = 1;
+	return;
+    }
+    return unless m/\.o\.cmd$/;
+
+    open(my $fh, '<', $_)
+	or die "failed to open $_: $!";
+    while (<$fh>) {
+	chomp;
+	if (m/^source_/) {
+	    # Only process stuff built from .S or .c
+	    return unless m/\.[Sc]$/;
+	}
+	if (m/^# header-stats: ([0-9]+) ([0-9]+) ([0-9]+)/) {
+	    push @{$data{hcount}}, $1;
+	    push @{$data{csize}}, $2;
+	    push @{$data{tsize}}, $2 + $3;
+	    push @{$data{ratio}}, $2 ? ($2 + $3)/$2 : 1.0;
+	    $data{ntu}++;
+	}
+    }
+    close($fh);
+}