kernelci.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: "Ricardo Cañuelo" <ricardo.canuelo@collabora.com>
To: kernelci@lists.linux.dev
Subject: On error identification, classification and related tooling
Date: Thu, 13 Jun 2024 10:11:53 +0200	[thread overview]
Message-ID: <877cet8do6.fsf@collabora.com> (raw)

Hi all,

In the past weeks a few discussions were held in multiple meetings and
online threads[1] about the problem of modeling errors found in test
runs in a way that they can be then profiled and classified (or tagged).

Some of us feel like this is a missing piece in the current state of the
art of many CI systems, with minor exceptions (eg. Syzbot), and that
introducing some means to operate with errors as "data types" may
greatly extend the usefulness of the data these systems are collecting.

IMHO, having massive databases of test results is useful for detecting
issues, reporting and browsing them, but there's much more that could be
done with that data if we provided additional layers of processing to
extract and model higher-level data from them. This has been discussed
as a long-term plan for KernelCI and CI in general for some time.

Error modeling and profiling is one of the areas that we'd like to
explore first.

Some individual contributors that go through test results and
regressions already do this kind of work manually by themselves by
inspecting the test logs, identifying the error causes and classifying
them, although there's no provisioning in Maestro or KCIDB yet to allow
users to provide this kind of curated information.

Personally, I'm exploring the possibility of having an automatic process
to analyze and profile the errors found in a test log in a standard
way. The goal I'm aiming for is to have a low-cost and system-agnostic
way to automatically digest a test log into a schema-based structured
data that we can store in a DB and can then use as first-class data to
perform comparisons and classifications. Some of the problems we could
address with this are:

- Automatically tell if an error happened in another test run
- Group test failures together depending on the errors they triggered
- Automatic classification of errors / test results / regressions
  depending on certain error parameters or contents

Some of these features are good end goals by themselves, some others are
important stepping stones towards other goals such as automatic triaging
of regressions or enhanced reports.

As a proof of concept and to evaluate the viability of this as an
automatic process, I started hacking a tool called logspec [2], which is
basically an extensible context-sensitive parser. It's in a very
experimental early stage and at this point is little more than a
springboard for ideas on this area.

In its current form, it can parse a number of different types of kernel
build errors (as provided by Maestro):

    ./logspec.py tests/logs/kbuild/kbuild_001.log kbuild

    {
        "errors": [
            {
                "error_type": "Compiler error",
                "location": "1266:3",
                "script": "scripts/Makefile.build:244",
                "src_file": "drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c",
                "target": "drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.o"
            }
        ]
    }

    ./logspec.py tests/logs/kbuild/kbuild_002.log kbuild
    {
        "errors": [
            {
                "error_type": "Kbuild/Make",
                "script": "Makefile:1953",
                "target": "modules"
            }
        ]
    }

    (full info of the same parsing):

    ./logspec.py tests/logs/kbuild/kbuild_002.log kbuild --json-full
    {
        "_match_end": 369194,
        "errors": [
            {
                "_report": "***\n*** The present kernel configuration has modules disabled.\n*** To use the module feature, please run \"make menuconfig\" etc.\n*** to enable CONFIG_MODULES.\n***\n",
                "error_type": "Kbuild/Make",
                "script": "Makefile:1953",
                "target": "modules"
            }
        ]
    }

And detect certain types of errors during linux startup (partial output below):

    ./logspec.py tests/logs/linux_boot/linux_boot_005.log generic_linux_boot
    {
        "bootloader_ok": true,
        "errors": [
            {
                "call_trace": [
                    "? __warn+0x98/0xda",
                    "? apply_returns+0xc0/0x241",
                    "? report_bug+0x96/0xda",
                    "? handle_bug+0x3c/0x65",
                    "? exc_invalid_op+0x14/0x65",
                    "? asm_exc_invalid_op+0x12/0x20",
                    "? apply_returns+0xc0/0x241",
                    "alternative_instructions+0x7d/0x143",
                    "arch_cpu_finalize_init+0x23/0x42",
                    "start_kernel+0x4da/0x58c",
                    "secondary_startup_64_no_verify+0xac/0xbb"
                ],
                "error_type": "WARNING: missing return thunk: 0xffffffffb6845838-0xffffffffb684583d: e9 00 00 00 00",
                "hardware": "Google Coral/Coral, BIOS  09/29/2020",
                "location": "arch/x86/kernel/alternative.c:730 apply_returns+0xc0/0x241",
                "modules": []
            },
            {
                "call_trace": [
                    "? __die_body+0x1b/0x5e",
                    "? no_context+0x36d/0x422",
                    "? mutex_lock+0x1c/0x3b",
                    "? exc_page_fault+0x249/0x3f0",
                    "? asm_exc_page_fault+0x1e/0x30",
                    "? string_nocheck+0x19/0x3d",
                    "string+0x42/0x4b",
                    "vsnprintf+0x21c/0x427",
                    "devm_kvasprintf+0x4a/0x9e",
                    "devm_kasprintf+0x4e/0x69",
                    "? __radix_tree_lookup+0x3a/0xba",
                    "__devm_ioremap_resource+0x7c/0x12d",
                    "intel_pmc_get_resources+0x97/0x29c [intel_pmc_bxt]",
                    "? devres_add+0x2f/0x40",
                    "intel_pmc_probe+0x81/0x176 [intel_pmc_bxt]",
                    "platform_drv_probe+0x2f/0x74",
                    "really_probe+0x15c/0x34e",
                    "driver_probe_device+0x9c/0xd0",
                    "device_driver_attach+0x3c/0x59",
                    "__driver_attach+0xa2/0xaf",
                    "? device_driver_attach+0x59/0x59",
                    "bus_for_each_dev+0x73/0xad",
                    "bus_add_driver+0xd8/0x1d4",
                    "driver_register+0x9e/0xdb",
                    "? 0xffffffffc00b7000",
                    "do_one_initcall+0x90/0x1ae",
                    "? slab_pre_alloc_hook.constprop.0+0x31/0x47",
                    "? kmem_cache_alloc_trace+0xfb/0x111",
                    "do_init_module+0x4b/0x1fd",
                    "__do_sys_finit_module+0x94/0xbf",
                    "__do_fast_syscall_32+0x71/0x86",
                    "do_fast_syscall_32+0x2f/0x6f",
                    "entry_SYSENTER_compat_after_hwframe+0x65/0x77"
                ],
                "error_type": "BUG: unable to handle page fault for address: 0000000000200286",
                "hardware": "Google Coral/Coral, BIOS  09/29/2020",
                "modules": [
                    "acpi_thermal_rel",
                    "chromeos_pstore",
                    "coreboot_table",
                    "ecc",
                    "ecdh_generic",
                    "elan_i2c",
                    "i2c_hid",
                    "int340x_thermal_zone",
                    "intel_pmc_bxt(+)",
                    "pinctrl_broxton"
                ]
            },
            ...
        ],
        "prompt_ok": true
    }

I've yet to decide on a schema for this structured data, but first I'd
prefer to keep on adding parsers to it to catch more conditions and
results.

It's possible that this approach isn't viable or realistic, considering
the lack of consistency even in very restricted types of errors (Linux
kernel error reports are particularly inconsistent) and the glitches and
other artifacts inherent to this kind of serial logs (interleaving of
lines, etc.). Still, maybe some of these problems can be mitigated by
applying a pre-filtering on the logs and running the parsers on
narrowed-down segments instead of on whole logs.

So I want to keep playing with this for now to see if it makes sense to
continue. Maybe someone else has a better approach to this problem
(ML-based, maybe?), so any feedback about the general idea and about the
implementation is welcome.

Thank you all,
Ricardo

---

[1] https://github.com/kernelci/kcidb-io/pull/78
[2] https://gitlab.collabora.com/rcn/logspec

             reply	other threads:[~2024-06-13  8:11 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-13  8:11 Ricardo Cañuelo [this message]
2024-06-16 10:25 ` On error identification, classification and related tooling Nikolai Kondrashov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=877cet8do6.fsf@collabora.com \
    --to=ricardo.canuelo@collabora.com \
    --cc=kernelci@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).