From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965681AbbLPK3m (ORCPT ); Wed, 16 Dec 2015 05:29:42 -0500 Received: from mail.skyhub.de ([78.46.96.112]:38757 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933879AbbLPK3k (ORCPT ); Wed, 16 Dec 2015 05:29:40 -0500 Date: Wed, 16 Dec 2015 11:29:30 +0100 From: Borislav Petkov To: fu.wei@linaro.org Cc: linaro-acpi@lists.linaro.org, linux-kernel@vger.kernel.org, zjzhang@codeaurora.org, tony.luck@intel.com, gong.chen@intel.com, ying.huang@intel.com, tomasz.nowicki@linaro.org, tekkamanninja@gmail.com, graeme.gregory@linaro.org, al.stone@linaro.org, hanjun.guo@linaro.org, jcm@redhat.com, mark.rutland@arm.com, catalin.marinas@arm.com, will.deacon@arm.com, rjw@rjwysocki.net, matt.fleming@intel.com, "Chen, Gong" Subject: Re: [PATCH v2 1/2] acpi, apei: add Boot Error Record Table (BERT) support Message-ID: <20151216102930.GA29775@pd.tnic> References: <=fu.wei@linaro.org> <1439916257-5483-1-git-send-email-fu.wei@linaro.org> <1439916257-5483-2-git-send-email-fu.wei@linaro.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <1439916257-5483-2-git-send-email-fu.wei@linaro.org> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Aug 19, 2015 at 12:44:16AM +0800, fu.wei@linaro.org wrote: > From: Huang Ying > > Under normal circumstances, when a hardware error occurs, kernel will > be notified via NMI, MCE or some other method, then kernel will > process the error condition, report it, and recover it if possible. > But sometime, the situation is so bad, so that firmware may choose to > reset directly without notifying Linux kernel. > > Linux kernel can use the Boot Error Record Table (BERT) to get the > un-notified hardware errors that occurred in a previous boot. In this > patch, the error information is reported via printk. > > For more information about BERT, please refer to ACPI Specification > version 6.0, section 18.3.1: > http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf > > [Tony: Applied some cleanups suggested by Fu Wei] > [Fu Wei: delete EXPORT_SYMBOL_GPL(bert_disable), and do some cleanups] > > Signed-off-by: Huang Ying > Signed-off-by: Tomasz Nowicki > Signed-off-by: Chen, Gong > Tested-by: Jonathan (Zhixiong) Zhang > Tested-by: Fu Wei > Signed-off-by: Tony Luck > Signed-off-by: Fu Wei > --- > Documentation/kernel-parameters.txt | 3 + > drivers/acpi/apei/Makefile | 2 +- > drivers/acpi/apei/bert.c | 162 ++++++++++++++++++++++++++++++++++++ > include/acpi/apei.h | 1 + > 4 files changed, 167 insertions(+), 1 deletion(-) > > diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt > index 1d6f045..7c6402c 100644 > --- a/Documentation/kernel-parameters.txt > +++ b/Documentation/kernel-parameters.txt > @@ -554,6 +554,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted. > > bootmem_debug [KNL] Enable bootmem allocator debug messages. > > + bert_disable [ACPI] > + Disable Boot Error Record Table (BERT) support. Please document what that new parameter is good for. > + > bttv.card= [HW,V4L] bttv (bt848 + bt878 based grabber cards) > bttv.radio= Most important insmod options are available as > kernel args too. > diff --git a/drivers/acpi/apei/Makefile b/drivers/acpi/apei/Makefile > index 5d575a9..e50573d 100644 > --- a/drivers/acpi/apei/Makefile > +++ b/drivers/acpi/apei/Makefile > @@ -3,4 +3,4 @@ obj-$(CONFIG_ACPI_APEI_GHES) += ghes.o > obj-$(CONFIG_ACPI_APEI_EINJ) += einj.o > obj-$(CONFIG_ACPI_APEI_ERST_DEBUG) += erst-dbg.o > > -apei-y := apei-base.o hest.o erst.o > +apei-y := apei-base.o hest.o erst.o bert.o > diff --git a/drivers/acpi/apei/bert.c b/drivers/acpi/apei/bert.c > new file mode 100644 > index 0000000..1426227 > --- /dev/null > +++ b/drivers/acpi/apei/bert.c > @@ -0,0 +1,162 @@ > +/* > + * APEI Boot Error Record Table (BERT) support > + * > + * Copyright 2011 Intel Corp. > + * Author: Huang Ying > + * > + * Under normal circumstances, when a hardware error occurs, kernel > + * will be notified via NMI, MCE or some other method, then kernel > + * will process the error condition, report it, and recover it if > + * possible. But sometime, the situation is so bad, so that firmware > + * may choose to reset directly without notifying Linux kernel. > + * > + * Linux kernel can use the Boot Error Record Table (BERT) to get the > + * un-notified hardware errors that occurred in a previous boot. > + * > + * For more information about BERT, please refer to ACPI Specification > + * version 4.0, section 17.3.1 > + * > + * This file is licensed under GPLv2. > + * > + */ > + > +#include > +#include > +#include > +#include > +#include > + > +#include "apei-internal.h" > + > +#define BERT_PFX "BERT: " This is done with pr_fmt > +static int bert_disable; > + > +static void __init bert_print_all(struct acpi_hest_generic_status *region, > + unsigned int region_len) > +{ > + struct acpi_hest_generic_status *estatus = region; > + int remain = region_len; > + u32 estatus_len; > + int first = 1; if (!estatus->block_status) return; > + while (remain > sizeof(struct acpi_hest_generic_status)) { > + /* No more error record */ ... records */ > + if (!estatus->block_status) > + break; This test should happen when you enter the function (see above) and when you assign to estatus, i.e. below, at the end of the while loop body: estatus = (void *)estatus + estatus_len; if (!estatus->block_status) break; > + > + estatus_len = cper_estatus_len(estatus); > + if (estatus_len < sizeof(struct acpi_hest_generic_status) || > + remain < estatus_len) { > + pr_err(FW_BUG BERT_PFX > + "Invalid error status block with length %u\n", "Invalid status block length (%u)" > + estatus_len); > + return; > + } > + > + if (cper_estatus_check(estatus)) { > + pr_err(FW_BUG BERT_PFX "Invalid error status block\n"); "Invalid error record." > + goto next; > + } > + > + if (first) { > + pr_info(HW_ERR "Error record from previous boot:\n"); Error records<--- plural. > + first = 0; > + } We have pr_info_once() for this. > + > + cper_estatus_print(KERN_INFO HW_ERR, estatus); > +next: > + estatus = (void *)estatus + estatus_len; > + remain -= estatus_len; > + } > +} > + > +static int __init setup_bert_disable(char *str) > +{ > + bert_disable = 1; > + > + return 0; > +} > +__setup("bert_disable", setup_bert_disable); > + > +static int __init bert_check_table(struct acpi_table_bert *bert_tab) > +{ > + if (bert_tab->header.length < sizeof(struct acpi_table_bert)) > + return -EINVAL; \n here > + if (bert_tab->region_length && This test looks redundant if you're going to compare it to the size of the BERT region struct below. > + bert_tab->region_length < sizeof(struct acpi_bert_region)) > + return -EINVAL; > + > + return 0; > +} > + > +static int __init bert_init(void) > +{ > + struct acpi_hest_generic_status *bert_region; > + struct acpi_table_bert *bert_tab; > + unsigned int region_len; > + acpi_status status; > + struct resource *r; > + int rc = -EINVAL; > + > + if (acpi_disabled) > + goto out; > + > + if (bert_disable) { > + pr_info(BERT_PFX "Boot Error Record Table (BERT) support is disabled.\n"); > + goto out; > + } > + > + status = acpi_get_table(ACPI_SIG_BERT, 0, > + (struct acpi_table_header **)&bert_tab); No need to break that line, just leave it stick out. > + if (status == AE_NOT_FOUND) { > + pr_err(BERT_PFX "Table is not found!\n"); This will get issued on *all* machines which don't have BERT - which is the majority now, I'm afraid - and that information is useless to people. It's not like they can do anything about it short of overriding their own firmware. So please remove that printk. > + goto out; > + } else if (ACPI_FAILURE(status)) { > + const char *msg = acpi_format_exception(status); > + > + pr_err(BERT_PFX "Failed to get table, %s\n", msg); > + goto out; > + } > + > + rc = bert_check_table(bert_tab); > + if (rc) { > + pr_err(FW_BUG BERT_PFX "BERT table is invalid\n"); Prefix already has "BERT". Either write it out: "BERT: Boot Error Record Table invalid" or make it shorter: "BERT: table invalid." > + goto out; > + } > + > + region_len = bert_tab->region_length; > + if (!region_len) { > + rc = 0; > + goto out; > + } Huh, we just checked ->region_length in bert_check_table(). This test is redundant. > + > + r = request_mem_region(bert_tab->address, region_len, "APEI BERT"); > + if (!r) { > + pr_err(BERT_PFX "Can't request iomem region <%016llx-%016llx>\n", > + (unsigned long long)bert_tab->address, > + (unsigned long long)bert_tab->address + region_len - 1); > + rc = -EIO; > + goto out; > + } > + > + bert_region = ioremap_cache(bert_tab->address, region_len); > + if (!bert_region) { > + rc = -ENOMEM; > + goto out_release; > + } > + > + bert_print_all(bert_region, region_len); > + > + iounmap(bert_region); > + > +out_release: > + release_mem_region(bert_tab->address, region_len); > +out: > + if (rc) > + bert_disable = 1; This assignment is redundant. We're not testing it anywhere else after bert_init() has run. > + > + return rc; > +} > + > +late_initcall(bert_init); > diff --git a/include/acpi/apei.h b/include/acpi/apei.h > index 76284bb..284801a 100644 > --- a/include/acpi/apei.h > +++ b/include/acpi/apei.h > @@ -23,6 +23,7 @@ extern bool ghes_disable; > #else > #define ghes_disable 1 > #endif > +extern int bert_disable; > > #ifdef CONFIG_ACPI_APEI > void __init acpi_hest_init(void); > -- > 2.4.3 > > -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply.