From patchwork Sun Jan 24 02:12:26 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= X-Patchwork-Id: 12041983 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 155F3C433E0 for ; Sun, 24 Jan 2021 02:13:51 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D8C7022B51 for ; Sun, 24 Jan 2021 02:13:50 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726460AbhAXCNt (ORCPT ); Sat, 23 Jan 2021 21:13:49 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59880 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726375AbhAXCNe (ORCPT ); Sat, 23 Jan 2021 21:13:34 -0500 Received: from mail-wr1-x42a.google.com (mail-wr1-x42a.google.com [IPv6:2a00:1450:4864:20::42a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7B32EC061786 for ; Sat, 23 Jan 2021 18:12:53 -0800 (PST) Received: by mail-wr1-x42a.google.com with SMTP id c12so8781856wrc.7 for ; Sat, 23 Jan 2021 18:12:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=NB7mhvFvUjEowPusLa+YA1W3DcEB1yfVBnq/lJzfrok=; b=kuJOpcN7oAF4urwSPspZ1XOh3SxMUDkmM3iQdk9jyu7L0p5KvEoWq4n6DLr4Cy3BwR OAyGSYp0YbXbLpVHYXfFs30HXjo1Fr7VY2QcX/VzaZYeMvK5UM5/whNPJjLyMlsXWoLu XOMMQ+S5Rc+1+ydr2hYMLJri+lOUGx260h0Nq83OQfuC5wQyVhuxMaobRtRmaqkp0nS0 mwuFqrF7YzI1n7bCgWKyWGy/8t8FTg4IlIHiiy9/sOSAoTiY1ok5HYv/t7DsXqbLu2KQ H7OUyKu/Bk8dYFJI8VbhsYCJTOVQZsNTRiet1FeAORv4a0XUQQASyR1RxGHcU9u6DbN7 chXA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=NB7mhvFvUjEowPusLa+YA1W3DcEB1yfVBnq/lJzfrok=; b=NoC/aPWgU4Z9r6cLkKbca3/CLfnfWbltruTH93phnT8sUD1S5DMeoWXDEzOemPwNM5 ivJjAo/7V4ZUg9IoVsZsJ8XtFS3WrkdtfTz20L7Zdy/tfp+nz+LSZrcv6xwVvlitwRlv rO+cWF7BCWJ7OLcGs8O0ljIGbF9juraec03i2KleLM7dnTYl+uRPbLTYIxBKD8Zbzvnx moqxZCZlio6S5p5UcD2Yj+Mh32hHY8ScmIuueXP3ONHnCHkEz75JoZ4MMHtJNI6m8RpM aS13u/uQtQPSG5WkUP43j/5ZvjzkvOafm3607SfSxyxSdOSJpC71IQflPECQoXkNfhb7 cu1A== X-Gm-Message-State: AOAM530/TaJv2F8mOH6OVVMVaFAep5dQyl0UjTLTJaAGYHmpBC+fbqOH SLt7p+uFkit+ExpvYN6uvvXdmjA0Mn9nUg== X-Google-Smtp-Source: ABdhPJzbFqQgTp9f2jjAPlJBfk0AwtPICvGgULJqOFe6EKVKjUQ5Uxw5GFsndPVcspZmVHjHTCrpIA== X-Received: by 2002:adf:f9d0:: with SMTP id w16mr11004913wrr.137.1611454371935; Sat, 23 Jan 2021 18:12:51 -0800 (PST) Received: from vm.nix.is (vm.nix.is. [2a01:4f8:120:2468::2]) by smtp.gmail.com with ESMTPSA id z184sm17380129wmg.7.2021.01.23.18.12.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 23 Jan 2021 18:12:51 -0800 (PST) From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= To: git@vger.kernel.org Cc: Junio C Hamano , =?utf-8?q?Carlo_Marcelo_Arenas_Bel?= =?utf-8?q?=C3=B3n?= , Johannes Schindelin , Todd Zullinger , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFz?= =?utf-8?b?b24=?= Subject: [PATCH v3 1/4] grep/pcre2 tests: don't rely on invalid UTF-8 data test Date: Sun, 24 Jan 2021 03:12:26 +0100 Message-Id: <20210124021229.25987-2-avarab@gmail.com> X-Mailer: git-send-email 2.29.2.222.g5d2a92d10f8 In-Reply-To: <20190726150818.6373-9-avarab@gmail.com> References: <20190726150818.6373-9-avarab@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org As noted in [1] when I originally added this test in [2] the test was completely broken as it lacked a redirect[3]. I now think this whole thing is overly fragile. Let's only test if we have a segfault here. Before this the first test's "test_cmp" was pretty meaningless. We were only testing if PCREv2 was so broken that it would spew out something completely unrelated on stdout, which isn't very plausible. In the second test we're relying on PCREv2 forever holding to the current behavior of the PCRE_UTF8 flag, as opposed to learning some optimistic graceful fallback to PCRE2_MATCH_INVALID_UTF in the future. If that happens having this test broken under bisecting would suck. A follow-up commit will actually test this case in a meaningful way under the PCRE2_MATCH_INVALID_UTF flag. Let's run this one unconditionally, and just make sure we don't segfault. 1. e714b898c6 (t7812: expect failure for grep -i with invalid UTF-8 data, 2019-11-29) 2. 8a5999838e (grep: stess test PCRE v2 on invalid UTF-8 data, 2019-07-26) 3. c74b3cbb83 (t7812: add missing redirects, 2019-11-26) Signed-off-by: Ævar Arnfjörð Bjarmason --- t/t7812-grep-icase-non-ascii.sh | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh index 03dba6685a..38457c2e4f 100755 --- a/t/t7812-grep-icase-non-ascii.sh +++ b/t/t7812-grep-icase-non-ascii.sh @@ -76,12 +76,7 @@ test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invali test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invalid UTF-8 data with -i' ' test_might_fail git grep -hi "Æ" invalid-0x80 >actual && - if test -s actual - then - test_cmp expected actual - fi && - test_must_fail git grep -hi "(*NO_JIT)Æ" invalid-0x80 >actual && - ! test_cmp expected actual + test_might_fail git grep -hi "(*NO_JIT)Æ" invalid-0x80 >actual ' test_done From patchwork Sun Jan 24 02:12:27 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= X-Patchwork-Id: 12041989 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E6F18C433E0 for ; Sun, 24 Jan 2021 02:14:02 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id BB99622B51 for ; Sun, 24 Jan 2021 02:14:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726481AbhAXCOB (ORCPT ); Sat, 23 Jan 2021 21:14:01 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59888 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726433AbhAXCNe (ORCPT ); Sat, 23 Jan 2021 21:13:34 -0500 Received: from mail-wr1-x42d.google.com (mail-wr1-x42d.google.com [IPv6:2a00:1450:4864:20::42d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3B396C061788 for ; Sat, 23 Jan 2021 18:12:54 -0800 (PST) Received: by mail-wr1-x42d.google.com with SMTP id r9so2143387wro.9 for ; Sat, 23 Jan 2021 18:12:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=uTPn9q3Vmv00nzaGRt1pGDNiQdNZu7+0/RJKBPNfn1U=; b=cj/aCPwXr/gtHW6bCTLb4iKTjldppN5ET/9kuv0keipnNikFZ3lHU/HNC+hxD7sx66 vDWzX20wqWwrpDaNJI+1+6xSfNBIiZBlaTnK9qSJGeYBygpW5X2ohRunbtOwjWR2Jm4w tUVZpWX8R1ApcLJn3irULC9MCIFP9tOgmZW0B+g88v61pmRUaz9xTPOTpxk2YztblOfe ZoXPUtNh8Bgxcml08Layf0ZhhjUFE2OgqozuV3ivfmOQ5q7hfm49t7INz8TqBtAdYXXF QIKm4vWG6VkH3FALxxY1ZxliXeTm/KpRG4Dk9QUyXLXgAhP7F5oOH+K2cS+3LogfStT1 qJew== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=uTPn9q3Vmv00nzaGRt1pGDNiQdNZu7+0/RJKBPNfn1U=; b=UGAgCNDyjmyqHXA8PfD01aPwQSoxPGxYXZ6KMSblU8yDIWnXSAYsNfcnwaR0kejlE8 l55zEeeUocacZBP23r6PctFI814RhfDZiryfdidAnyzAsBolHJ2tKcdXdNipl3FYJRFd DtWO2Uv1MA3kC5ivoD6tkUinUivQzZ/BUcLNrogMeSc9AyQcfBBy/Ga4HoqbxKPZiyGX zn5M+PZA8zc7f5yC7/MBiBFwzMc08IGMeNLP4quL3aOqCJtwubZUgGZY5ZjUmwNjf8Ey AQbyvFDsCYRMwHaULcmpe0moIMEIdnppsyLg5LXJFdw1+6O5ZYxRt/TgV8a4XU7lfkjJ EUMw== X-Gm-Message-State: AOAM531KeCmRs4DSBf3vtQz6XeXYYFZDC/41OF+aXaiPM8S86RAv7c7N KDsxuRBeHMQUGg0N5XqeLQG/cAVHxCcwSg== X-Google-Smtp-Source: ABdhPJzGa+Mbuwlx3g3mnmosmg/3bTdJa+SlPwdp0pL5H4WQv6m80lVjtG4ufjRsZd8g8UVlaSCaew== X-Received: by 2002:a05:6000:185:: with SMTP id p5mr10833867wrx.403.1611454372818; Sat, 23 Jan 2021 18:12:52 -0800 (PST) Received: from vm.nix.is (vm.nix.is. [2a01:4f8:120:2468::2]) by smtp.gmail.com with ESMTPSA id z184sm17380129wmg.7.2021.01.23.18.12.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 23 Jan 2021 18:12:52 -0800 (PST) From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= To: git@vger.kernel.org Cc: Junio C Hamano , =?utf-8?q?Carlo_Marcelo_Arenas_Bel?= =?utf-8?q?=C3=B3n?= , Johannes Schindelin , Todd Zullinger , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFz?= =?utf-8?b?b24=?= Subject: [PATCH v3 2/4] grep/pcre2: simplify boolean spaghetti Date: Sun, 24 Jan 2021 03:12:27 +0100 Message-Id: <20210124021229.25987-3-avarab@gmail.com> X-Mailer: git-send-email 2.29.2.222.g5d2a92d10f8 In-Reply-To: <20190726150818.6373-9-avarab@gmail.com> References: <20190726150818.6373-9-avarab@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Simplify an expression I added in 870eea8166 (grep: do not enter PCRE2_UTF mode on fixed matching, 2019-07-26) by using a simple application of De Morgan's laws[1]. I.e.: NOT(A && B) is Equivalent to (NOT(A) OR NOT(B)) 1. https://en.wikipedia.org/wiki/De_Morgan%27s_laws Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/grep.c b/grep.c index efeb6dc58d..0bb772f727 100644 --- a/grep.c +++ b/grep.c @@ -491,7 +491,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt options |= PCRE2_CASELESS; } if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && - !(!opt->ignore_case && (p->fixed || p->is_fixed))) + (opt->ignore_case || !(p->fixed || p->is_fixed))) options |= PCRE2_UTF; p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern, From patchwork Sun Jan 24 02:12:28 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= X-Patchwork-Id: 12041987 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3A8D1C433E6 for ; Sun, 24 Jan 2021 02:14:01 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 1092D22B51 for ; Sun, 24 Jan 2021 02:14:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726475AbhAXCN7 (ORCPT ); Sat, 23 Jan 2021 21:13:59 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59892 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726440AbhAXCNg (ORCPT ); Sat, 23 Jan 2021 21:13:36 -0500 Received: from mail-wm1-x329.google.com (mail-wm1-x329.google.com [IPv6:2a00:1450:4864:20::329]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5DB2FC06178B for ; Sat, 23 Jan 2021 18:12:55 -0800 (PST) Received: by mail-wm1-x329.google.com with SMTP id u14so915946wml.4 for ; Sat, 23 Jan 2021 18:12:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=AM9QuCZGS+PF11PLhLic4kof6DkGWJzpkhV96cwlf/0=; b=CgVXbVgwKvDyjinYzjsBkCAcfvBUMNkpus6Mt9p3FDrEcQ1DY8eM5EuZobyLYd91Qj ISK1+vFZutzADxxa+PMUV3ATqyotqzDQR2D2CI9zVMPkSiNSkZH/LmOb3cszh+3Spq0T cK1ilOuDeLrHKu052aPQbrftoXNSU4U3foPcMmAvwh5ShvgcuVxUqdTZJMaueqH/yBYN Lv/DTgw0WApeCnXnDJi1cQRej/qIdOcj9s/PssK9Af8MPPzeT7GWtmqAi1oOcyXPNkRp ne5ZP9jDF+XbkKn0jRlcjcKUG5Y3JPLR7wGtKZg5gIQPBqW0Pm3PsQq9HBJYerXuMq7O UHbw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=AM9QuCZGS+PF11PLhLic4kof6DkGWJzpkhV96cwlf/0=; b=VkztUrIcvtVnFwT5tYFz8szW66JdBv1SLlYOi9VtA0z0Q19/wU0EKN+Jkpf964Edfp 1/MnZUD3gSq3gAervqlYSQsm/hD15k9jWM32xRw8c1hT9tOfqprQI9QhBOCPg/XFFXvb Ace9TyPW5hjr9mxRcK/6Du7pJefwGof95qeQ8xxQjHd7HSPAwQ4+55xUGDcwHZl/t6PW TJZJCvY+lZzyyaZXCu2UX+dJdl0RR7lMN/0XZ/6ObFD/rrxMxjn5z6IcDrbWxgx0FvGu NAvg9vcZ4302wAHOXcTA+bydoRR/qle6Qe+PepF39v12NuAOmCPGUFh/9NXeEDp6fWiC GsZg== X-Gm-Message-State: AOAM533vQcPGkbwLnLYDJvKEv6fPq4YJwOWF3z1hLoVBFqVS7kqc9bAP uJDrUVPKE32hLSN0MvT5R8K6LBTqsOZsIw== X-Google-Smtp-Source: ABdhPJwy3cq3x2S8BPIQQbmCjDGj1OdCe808vlm4VNhrIvWbZOkNWd8HPpeOaG4nzUo9eFTXclMUcQ== X-Received: by 2002:a1c:2003:: with SMTP id g3mr9850080wmg.90.1611454373921; Sat, 23 Jan 2021 18:12:53 -0800 (PST) Received: from vm.nix.is (vm.nix.is. [2a01:4f8:120:2468::2]) by smtp.gmail.com with ESMTPSA id z184sm17380129wmg.7.2021.01.23.18.12.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 23 Jan 2021 18:12:53 -0800 (PST) From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= To: git@vger.kernel.org Cc: Junio C Hamano , =?utf-8?q?Carlo_Marcelo_Arenas_Bel?= =?utf-8?q?=C3=B3n?= , Johannes Schindelin , Todd Zullinger , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFz?= =?utf-8?b?b24=?= Subject: [PATCH v3 3/4] grep/pcre2: further simplify boolean spaghetti Date: Sun, 24 Jan 2021 03:12:28 +0100 Message-Id: <20210124021229.25987-4-avarab@gmail.com> X-Mailer: git-send-email 2.29.2.222.g5d2a92d10f8 In-Reply-To: <20190726150818.6373-9-avarab@gmail.com> References: <20190726150818.6373-9-avarab@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Follow-up the last commit by splitting the fixed check for the PCRE2_UTF flag into a variable. Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/grep.c b/grep.c index 0bb772f727..242b4a3506 100644 --- a/grep.c +++ b/grep.c @@ -473,6 +473,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt int jitret; int patinforet; size_t jitsizearg; + const int fixed = p->fixed || p->is_fixed; assert(opt->pcre2); @@ -491,7 +492,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt options |= PCRE2_CASELESS; } if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && - (opt->ignore_case || !(p->fixed || p->is_fixed))) + (opt->ignore_case || !fixed)) options |= PCRE2_UTF; p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern, From patchwork Sun Jan 24 02:12:29 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= X-Patchwork-Id: 12041985 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 76B03C433E0 for ; Sun, 24 Jan 2021 02:13:59 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 3F46422B51 for ; Sun, 24 Jan 2021 02:13:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726472AbhAXCNu (ORCPT ); Sat, 23 Jan 2021 21:13:50 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59900 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726448AbhAXCNh (ORCPT ); Sat, 23 Jan 2021 21:13:37 -0500 Received: from mail-wr1-x433.google.com (mail-wr1-x433.google.com [IPv6:2a00:1450:4864:20::433]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6DF4FC061793 for ; Sat, 23 Jan 2021 18:12:56 -0800 (PST) Received: by mail-wr1-x433.google.com with SMTP id d16so8110925wro.11 for ; Sat, 23 Jan 2021 18:12:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=/pboRxsKgfNSAirMxeF2E7FeLZxextRKJNWowahPAA8=; b=rMvxUokkkQAfOVQb/u8CM58SKVHArM3kkOMIjRoaitGcczmvb0Bz137Z6ArmZeWcvV rVxIy/dz8vTfLpaaFojp/z3YHbXLc86GRi290yySRvG5pxkE3MvSLg7oIRBVfuYCHX8y c4B0ae8B6Rkc0dxeFR3irpMTffZLeR8hvlNPj2NK+NaMlmOdc0jSKB4IP2+F18m6CExj GlTXvzzkHaqQ7YFSrPaqa9w7/xhMwdrUkoGAQjeOksXnmDnCfQKmWz6SEtd8S+hhb8N3 ki+PZPxB1vTidwrFwXfx1c/27u879zSLew7Zg3VP1gzNdaEW3n+xFLzwdcNARyFjT0Qa uynQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=/pboRxsKgfNSAirMxeF2E7FeLZxextRKJNWowahPAA8=; b=YTkHnJnozM9x9olroNhhr6oQcMs8qfr64jIwDhkU1+2XgL1secQsf5MRdKD71OhJZ0 h15w7OwT0sk2nDhVVuvRK5f2No2d9vCa458TB7PMKO5I0Pd/sUSKTgBcj01ykoe4VLlP 1h8WVQ5EVxEJoyXUFCfoex8o5nUTaGUF9+cxE+GzLRJduq6fzTDCEfNV0g8rMOfOqLoZ kQIoiN7WVndK7p0gHcIg5u1enp4+UoT5Zvcqo9TMySEke5JpjuIX1X6HjccU0iC42SGu zDCLbjO471DWr0FByEqR1Ec2wkrAKc5zp3vqx2EYft0IxS8O8GMQAHsf6Qx9t9SVEbm8 Y+hw== X-Gm-Message-State: AOAM532Sdzv0F6uxofzcZRhKNCq/CmsbzOlx/hi67CXRe/6sv9kNOPdY 5GC9mAYxtFI9ZFnOxkenQZPDj6ziLZshTA== X-Google-Smtp-Source: ABdhPJwZQKfYhOhvObwuos5W7U9rBP+9aAajOmn0+UydntmDpXOPm2b5DYKAzJAc53WdPCisn7Fy8w== X-Received: by 2002:a5d:47ce:: with SMTP id o14mr11311198wrc.18.1611454374899; Sat, 23 Jan 2021 18:12:54 -0800 (PST) Received: from vm.nix.is (vm.nix.is. [2a01:4f8:120:2468::2]) by smtp.gmail.com with ESMTPSA id z184sm17380129wmg.7.2021.01.23.18.12.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 23 Jan 2021 18:12:54 -0800 (PST) From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= To: git@vger.kernel.org Cc: Junio C Hamano , =?utf-8?q?Carlo_Marcelo_Arenas_Bel?= =?utf-8?q?=C3=B3n?= , Johannes Schindelin , Todd Zullinger , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFz?= =?utf-8?b?b24=?= Subject: [PATCH v3 4/4] grep/pcre2: better support invalid UTF-8 haystacks Date: Sun, 24 Jan 2021 03:12:29 +0100 Message-Id: <20210124021229.25987-5-avarab@gmail.com> X-Mailer: git-send-email 2.29.2.222.g5d2a92d10f8 In-Reply-To: <20190726150818.6373-9-avarab@gmail.com> References: <20190726150818.6373-9-avarab@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Improve the support for invalid UTF-8 haystacks given a non-ASCII needle when using the PCREv2 backend. This is a more complete fix for a bug I started to fix in 870eea8166 (grep: do not enter PCRE2_UTF mode on fixed matching, 2019-07-26), now that PCREv2 has the PCRE2_MATCH_INVALID_UTF mode we can make use of it. This fixes the sort of case described in 8a5999838e (grep: stess test PCRE v2 on invalid UTF-8 data, 2019-07-26), i.e.: - The subject string is non-ASCII (e.g. "ævar") - We're under a is_utf8_locale(), e.g. "en_US.UTF-8", not "C" - We are using --ignore-case, or we're a non-fixed pattern If those conditions were satisfied and we matched found non-valid UTF-8 data PCREv2 might bark on it, in practice this only happened under the JIT backend (turned on by default on most platforms). Ultimately this fixes a "regression" in b65abcafc7 ("grep: use PCRE v2 for optimized fixed-string search", 2019-07-01), I'm putting that in scare-quotes because before then we wouldn't properly support these complex case-folding, locale etc. cases either, it just broke in different ways. There was a bug related to this the PCRE2_NO_START_OPTIMIZE flag fixed in PCREv2 10.36. It can be worked around by setting the PCRE2_NO_START_OPTIMIZE flag. Let's do that in those cases, and add tests for the bug. Signed-off-by: Ævar Arnfjörð Bjarmason --- Makefile | 1 + grep.c | 8 +++++- grep.h | 4 +++ t/helper/test-pcre2-config.c | 12 +++++++++ t/helper/test-tool.c | 1 + t/helper/test-tool.h | 1 + t/t7812-grep-icase-non-ascii.sh | 46 ++++++++++++++++++++++++++++++++- 7 files changed, 71 insertions(+), 2 deletions(-) create mode 100644 t/helper/test-pcre2-config.c diff --git a/Makefile b/Makefile index 4edfda3e00..42a7ed96e2 100644 --- a/Makefile +++ b/Makefile @@ -722,6 +722,7 @@ TEST_BUILTINS_OBJS += test-online-cpus.o TEST_BUILTINS_OBJS += test-parse-options.o TEST_BUILTINS_OBJS += test-parse-pathspec-file.o TEST_BUILTINS_OBJS += test-path-utils.o +TEST_BUILTINS_OBJS += test-pcre2-config.o TEST_BUILTINS_OBJS += test-pkt-line.o TEST_BUILTINS_OBJS += test-prio-queue.o TEST_BUILTINS_OBJS += test-proc-receive.o diff --git a/grep.c b/grep.c index 242b4a3506..305c579aff 100644 --- a/grep.c +++ b/grep.c @@ -493,7 +493,13 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt } if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && (opt->ignore_case || !fixed)) - options |= PCRE2_UTF; + options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); + + if (PCRE2_MATCH_INVALID_UTF && + options & (PCRE2_UTF | PCRE2_CASELESS) && + !(PCRE2_MAJOR >= 10 && PCRE2_MAJOR >= 36)) + /* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */ + options |= PCRE2_NO_START_OPTIMIZE; p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern, p->patternlen, options, &error, &erroffset, diff --git a/grep.h b/grep.h index b5c4e223a8..ade21c8812 100644 --- a/grep.h +++ b/grep.h @@ -18,6 +18,10 @@ typedef int pcre2_code; typedef int pcre2_match_data; typedef int pcre2_compile_context; #endif +#ifndef PCRE2_MATCH_INVALID_UTF +/* PCRE2_MATCH_* dummy also with !USE_LIBPCRE2, for test-pcre2-config.c */ +#define PCRE2_MATCH_INVALID_UTF 0 +#endif #include "thread-utils.h" #include "userdiff.h" diff --git a/t/helper/test-pcre2-config.c b/t/helper/test-pcre2-config.c new file mode 100644 index 0000000000..5258fdddba --- /dev/null +++ b/t/helper/test-pcre2-config.c @@ -0,0 +1,12 @@ +#include "test-tool.h" +#include "cache.h" +#include "grep.h" + +int cmd__pcre2_config(int argc, const char **argv) +{ + if (argc == 2 && !strcmp(argv[1], "has-PCRE2_MATCH_INVALID_UTF")) { + int value = PCRE2_MATCH_INVALID_UTF; + return !value; + } + return 1; +} diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c index 9d6d14d929..f97cd9f48a 100644 --- a/t/helper/test-tool.c +++ b/t/helper/test-tool.c @@ -46,6 +46,7 @@ static struct test_cmd cmds[] = { { "parse-options", cmd__parse_options }, { "parse-pathspec-file", cmd__parse_pathspec_file }, { "path-utils", cmd__path_utils }, + { "pcre2-config", cmd__pcre2_config }, { "pkt-line", cmd__pkt_line }, { "prio-queue", cmd__prio_queue }, { "proc-receive", cmd__proc_receive}, diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h index a6470ff62c..28072c0ad5 100644 --- a/t/helper/test-tool.h +++ b/t/helper/test-tool.h @@ -35,6 +35,7 @@ int cmd__online_cpus(int argc, const char **argv); int cmd__parse_options(int argc, const char **argv); int cmd__parse_pathspec_file(int argc, const char** argv); int cmd__path_utils(int argc, const char **argv); +int cmd__pcre2_config(int argc, const char **argv); int cmd__pkt_line(int argc, const char **argv); int cmd__prio_queue(int argc, const char **argv); int cmd__proc_receive(int argc, const char **argv); diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh index 38457c2e4f..e5d1e4ea68 100755 --- a/t/t7812-grep-icase-non-ascii.sh +++ b/t/t7812-grep-icase-non-ascii.sh @@ -57,7 +57,12 @@ test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: setup invalid UTF-8 data' printf "\\200\\n" >invalid-0x80 && echo "ævar" >expected && cat expected >>invalid-0x80 && - git add invalid-0x80 + git add invalid-0x80 && + + # Test for PCRE2_MATCH_INVALID_UTF bug + # https://bugs.exim.org/show_bug.cgi?id=2642 + printf "\\345Aæ\\n" >invalid-0xe5 && + git add invalid-0xe5 ' test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep ASCII from invalid UTF-8 data' ' @@ -67,6 +72,13 @@ test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep ASCII from invalid UT test_cmp expected actual ' +test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep ASCII from invalid UTF-8 data (PCRE2 bug #2642)' ' + git grep -h "Aæ" invalid-0xe5 >actual && + test_cmp invalid-0xe5 actual && + git grep -h "(*NO_JIT)Aæ" invalid-0xe5 >actual && + test_cmp invalid-0xe5 actual +' + test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invalid UTF-8 data' ' git grep -h "æ" invalid-0x80 >actual && test_cmp expected actual && @@ -74,9 +86,41 @@ test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invali test_cmp expected actual ' +test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invalid UTF-8 data (PCRE2 bug #2642)' ' + git grep -h "Aæ" invalid-0xe5 >actual && + test_cmp invalid-0xe5 actual && + git grep -h "(*NO_JIT)Aæ" invalid-0xe5 >actual && + test_cmp invalid-0xe5 actual +' + +test_lazy_prereq PCRE2_MATCH_INVALID_UTF ' + test-tool pcre2-config has-PCRE2_MATCH_INVALID_UTF +' + test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invalid UTF-8 data with -i' ' test_might_fail git grep -hi "Æ" invalid-0x80 >actual && test_might_fail git grep -hi "(*NO_JIT)Æ" invalid-0x80 >actual ' +test_expect_success GETTEXT_LOCALE,LIBPCRE2,PCRE2_MATCH_INVALID_UTF 'PCRE v2: grep non-ASCII from invalid UTF-8 data with -i' ' + git grep -hi "Æ" invalid-0x80 >actual && + test_cmp expected actual && + git grep -hi "(*NO_JIT)Æ" invalid-0x80 >actual && + test_cmp expected actual +' + +test_expect_success GETTEXT_LOCALE,LIBPCRE2,PCRE2_MATCH_INVALID_UTF 'PCRE v2: grep non-ASCII from invalid UTF-8 data with -i (PCRE2 bug #2642)' ' + git grep -hi "Æ" invalid-0xe5 >actual && + test_cmp invalid-0xe5 actual && + git grep -hi "(*NO_JIT)Æ" invalid-0xe5 >actual && + test_cmp invalid-0xe5 actual && + + # Only the case of grepping the ASCII part in a way that + # relies on -i fails + git grep -hi "aÆ" invalid-0xe5 >actual && + test_cmp invalid-0xe5 actual && + git grep -hi "(*NO_JIT)aÆ" invalid-0xe5 >actual && + test_cmp invalid-0xe5 actual +' + test_done