From patchwork Sun Dec 3 21:30:33 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Timofey Titovets X-Patchwork-Id: 10089423 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id BFEA46035E for ; Sun, 3 Dec 2017 21:30:50 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A21CD28E73 for ; Sun, 3 Dec 2017 21:30:50 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 951EB28E7D; Sun, 3 Dec 2017 21:30:50 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=2.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, DKIM_SIGNED, FREEMAIL_FROM, RCVD_IN_DNSWL_HI, T_DKIM_INVALID autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id F105728E73 for ; Sun, 3 Dec 2017 21:30:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752693AbdLCVar (ORCPT ); Sun, 3 Dec 2017 16:30:47 -0500 Received: from mail-wr0-f196.google.com ([209.85.128.196]:37137 "EHLO mail-wr0-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752162AbdLCVaq (ORCPT ); Sun, 3 Dec 2017 16:30:46 -0500 Received: by mail-wr0-f196.google.com with SMTP id k61so15200289wrc.4 for ; Sun, 03 Dec 2017 13:30:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id; bh=DocF+oXNcpMK0Cbz6SceuUKELGVwxBze1jQqJ05jZO4=; b=JuQCQUOM8ysY835rGeavN771qM7LvUADqpcW9SWgUIwYRCRwGZ96/9mTZ1cXtw58A5 sxTOZ8eCszz2Iyk6xu4P5Dgc+A9rdGJ5nLYc4Kr9JlGTBEzzgpuFCTEwsdQ/zIXAYQnX SFjbauONXBK5Jqqb35nqEa1aq4p2+8V/QGQXw7699q26aVwajS3AwDcPwlgB1t3hEl8N qJG/ugUWYc2C31mXPXzMfLOhUjOQGBG2BqSRRL2VzHE/1foQ54yGGtrHbaxfUL9xfHJG YTJfg/1UU1udZ1K7wAwxY0vdrik3xwnbNxs3OMeB2BjkliBXSz0ZJXOXgAnnw/U58z9D HF4A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id; bh=DocF+oXNcpMK0Cbz6SceuUKELGVwxBze1jQqJ05jZO4=; b=uVNT3Fmbu9BmK++5RVVwsbrabTkyiilSrzvZYWFJyeVydM+ylSI2Cz+PNq1ztJnRR8 OH7Sy5wzBoRsrXcfd5l0v2PoFfPGRqkeru5ayyLHtwtfC8HcBSNgISZZFFw/wS3zs8Tv EC5HGW14Tp38LhjZoFSOHa0jx2g8/UJYXQbrnonxYtAGk6cN/qlpXBHDo2tmKC5rLJqy w6aXyAg2Y5H5IW1ChluDUk7zDAg6OsyqiNQoohHBx1u+qMpwiMm18y0j6eb357q0XpD6 039iPs7G+SyYhXYEHa05/A7makhH55TCE3Jyww1s43zgOSNMsvwTXzy8gRdKJH7pxSkU 68Sw== X-Gm-Message-State: AJaThX703r/CC/9Y1xWxKh2PciHYPeqjxSuUlZ5Uh/qELuHCnDH2IfXO KqVmmxaWZjOHTyyHfKqsA7kcTQ== X-Google-Smtp-Source: AGs4zMbOxHSiW74VeMixfT7NULw/sT6/3GyTmiKvAwfBisv9yYGlbINn+jMzBD9uA7yMhZntdyqJCw== X-Received: by 10.223.182.147 with SMTP id j19mr10728715wre.159.1512336644367; Sun, 03 Dec 2017 13:30:44 -0800 (PST) Received: from titovetst-beplan.lan (nat6-minsk-pool-46-53-208-190.telecom.by. [46.53.208.190]) by smtp.gmail.com with ESMTPSA id 2sm9653059wrg.49.2017.12.03.13.30.43 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 03 Dec 2017 13:30:43 -0800 (PST) From: Timofey Titovets To: linux-btrfs@vger.kernel.org Cc: Timofey Titovets Subject: [PATCH v2] Btrfs: heuristic replace heap sort with radix sort Date: Mon, 4 Dec 2017 00:30:33 +0300 Message-Id: <20171203213033.28258-1-nefelim4ag@gmail.com> X-Mailer: git-send-email 2.15.1 Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Slowest part of heuristic for now is kernel heap sort() It's can take up to 55% of runtime on sorting bucket items. As sorting will always call on most data sets to get correctly byte_core_set_size, the only way to speed up heuristic, is to speed up sort on bucket. Add a general radix_sort function. Radix sort require 2 buffers, one full size of input array and one for store counters (jump addresses). That increase usage per heuristic workspace +1KiB 8KiB + 1KiB -> 8KiB + 2KiB That is LSD Radix, i use 4 bit as a base for calculating, to make counters array acceptable small (16 elements * 8 byte). That Radix sort implementation have several points to adjust, I added him to make radix sort general usable in kernel, like heap sort, if needed. Performance tested in userspace copy of heuristic code, throughput: - average <-> random data: ~3500 MiB/s - heap sort - average <-> random data: ~6000 MiB/s - radix sort Changes: v1 -> v2: - Tested on Big Endian - Drop most of multiply operations - Separately allocate sort buffer Signed-off-by: Timofey Titovets --- fs/btrfs/compression.c | 147 ++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 140 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index ae016699d13e..19b52982deda 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -33,7 +33,6 @@ #include #include #include -#include #include #include "ctree.h" #include "disk-io.h" @@ -752,6 +751,8 @@ struct heuristic_ws { u32 sample_size; /* Buckets store counters for each byte value */ struct bucket_item *bucket; + /* Sorting buffer */ + struct bucket_item *bucket_b; struct list_head list; }; @@ -763,6 +764,7 @@ static void free_heuristic_ws(struct list_head *ws) kvfree(workspace->sample); kfree(workspace->bucket); + kfree(workspace->bucket_b); kfree(workspace); } @@ -782,6 +784,10 @@ static struct list_head *alloc_heuristic_ws(void) if (!ws->bucket) goto fail; + ws->bucket_b = kcalloc(BUCKET_SIZE, sizeof(*ws->bucket_b), GFP_KERNEL); + if (!ws->bucket_b) + goto fail; + INIT_LIST_HEAD(&ws->list); return &ws->list; fail: @@ -1278,13 +1284,136 @@ static u32 shannon_entropy(struct heuristic_ws *ws) return entropy_sum * 100 / entropy_max; } -/* Compare buckets by size, ascending */ -static int bucket_comp_rev(const void *lv, const void *rv) +#define RADIX_BASE 4 +#define COUNTERS_SIZE (1 << RADIX_BASE) + +static inline uint8_t get4bits(uint64_t num, int shift) { + uint8_t low4bits; + num = num >> shift; + /* Reverse order */ + low4bits = (COUNTERS_SIZE - 1) - (num % COUNTERS_SIZE); + return low4bits; +} + +static inline void copy_cell(void *dst, int dest_i, void *src, int src_i) { - const struct bucket_item *l = (const struct bucket_item *)lv; - const struct bucket_item *r = (const struct bucket_item *)rv; + struct bucket_item *dstv = (struct bucket_item *) dst; + struct bucket_item *srcv = (struct bucket_item *) src; + dstv[dest_i] = srcv[src_i]; +} - return r->count - l->count; +static inline uint64_t get_num(const void *a, int i) +{ + struct bucket_item *av = (struct bucket_item *) a; + return av[i].count; +} + +/* + * Use 4 bits as radix base + * Use 16 uint64_t counters for calculating new possition in buf array + * + * @array - array that will be sorted + * @array_buf - buffer array to store sorting results + * must be equal in size to @array + * @num - array size + * @max_cell - Link to element with maximum possible value + * that can be used to cap radix sort iterations + * if we know maximum value before call sort + * @get_num - function to extract number from array + * @copy_cell - function to copy data from array to array_buf + * and vise versa + * @get4bits - function to get 4 bits from number at specified offset + */ + +static void radix_sort(void *array, void *array_buf, + int num, + const void *max_cell, + uint64_t (*get_num)(const void *, int i), + void (*copy_cell)(void *dest, int dest_i, + void* src, int src_i), + uint8_t (*get4bits)(uint64_t num, int shift)) +{ + u64 max_num; + uint64_t buf_num; + uint64_t counters[COUNTERS_SIZE]; + uint64_t new_addr; + int i; + int addr; + int bitlen; + int shift; + + /* + * Try avoid useless loop iterations + * For small numbers stored in big counters + * example: 48 33 4 ... in 64bit array + */ + if (!max_cell) { + max_num = get_num(array, 0); + for (i = 1; i < num; i++) { + buf_num = get_num(array, i); + if (buf_num > max_num) + max_num = buf_num; + } + } else { + max_num = get_num(max_cell, 0); + } + + buf_num = ilog2(max_num); + bitlen = ALIGN(buf_num, RADIX_BASE*2); + + shift = 0; + while (shift < bitlen) { + memset(counters, 0, sizeof(counters)); + + for (i = 0; i < num; i++) { + buf_num = get_num(array, i); + addr = get4bits(buf_num, shift); + counters[addr]++; + } + + for (i = 1; i < COUNTERS_SIZE; i++) { + counters[i] += counters[i-1]; + } + + for (i = (num - 1); i >= 0; i --) { + buf_num = get_num(array, i); + addr = get4bits(buf_num, shift); + counters[addr]--; + new_addr = counters[addr]; + copy_cell(array_buf, new_addr, array, i); + } + + shift += RADIX_BASE; + + /* + * For normal radix, that expected to + * move data from tmp array, to main. + * But that require some CPU time + * Avoid that by doing another sort iteration + * to origin array instead of memcpy() + */ + memset(counters, 0, sizeof(counters)); + + for (i = 0; i < num; i ++) { + buf_num = get_num(array_buf, i); + addr = get4bits(buf_num, shift); + counters[addr]++; + } + + for (i = 1; i < COUNTERS_SIZE; i++) { + counters[i] += counters[i-1]; + } + + for (i = (num - 1); i >= 0; i--) { + buf_num = get_num(array_buf, i); + addr = get4bits(buf_num, shift); + counters[addr]--; + new_addr = counters[addr]; + copy_cell(array, new_addr, array_buf, i); + } + + shift += RADIX_BASE; + } } /* @@ -1312,9 +1441,13 @@ static int byte_core_set_size(struct heuristic_ws *ws) u32 coreset_sum = 0; const u32 core_set_threshold = ws->sample_size * 90 / 100; struct bucket_item *bucket = ws->bucket; + struct bucket_item max_cell; /* Sort in reverse order */ - sort(bucket, BUCKET_SIZE, sizeof(*bucket), &bucket_comp_rev, NULL); + max_cell.count = MAX_SAMPLE_SIZE; + radix_sort(ws->bucket, ws->bucket_b, + BUCKET_SIZE, &max_cell, + get_num, copy_cell, get4bits); for (i = 0; i < BYTE_CORE_SET_LOW; i++) coreset_sum += bucket[i].count;