From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933382AbbFWPVr (ORCPT ); Tue, 23 Jun 2015 11:21:47 -0400 Received: from bh-25.webhostbox.net ([208.91.199.152]:49765 "EHLO bh-25.webhostbox.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933140AbbFWPVh (ORCPT ); Tue, 23 Jun 2015 11:21:37 -0400 Date: Tue, 23 Jun 2015 08:21:31 -0700 From: Guenter Roeck To: Fu Wei Cc: Suravee Suthikulpanit , Linaro ACPI Mailman List , linux-watchdog@vger.kernel.org, devicetree@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, Wei Fu , G Gregory , Al Stone , Hanjun Guo , Timur Tabi , Ashwin Chaugule , Arnd Bergmann , Vipul Gandhi , Wim Van Sebroeck , Jon Masters , Leo Duran , Jon Corbet , Mark Rutland , Catalin Marinas , Will Deacon , rjw@rjwysocki.net Subject: Re: [non-pretimeout,4/7] Watchdog: introduce ARM SBSA watchdog driver Message-ID: <20150623152131.GA9990@roeck-us.net> References: <1433958452-23721-5-git-send-email-fu.wei@linaro.org> <20150611162810.GA22711@roeck-us.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) X-Authenticated_sender: guenter@roeck-us.net X-OutGoing-Spam-Status: No, score=0.0 X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - bh-25.webhostbox.net X-AntiAbuse: Original Domain - vger.kernel.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - roeck-us.net X-Get-Message-Sender-Via: bh-25.webhostbox.net: authenticated_id: guenter@roeck-us.net X-Source: X-Source-Args: X-Source-Dir: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jun 23, 2015 at 09:26:35PM +0800, Fu Wei wrote: > Hi Guenter, [ ...] > > > >> + * When the first timeout occurs, WS0(SPI or LPI) is triggered, > >> + * the second timeout period(as long as the first timeout period) starts. > > > > no longer accurate if WOR is used for the second period. > > > >> + * In WS0 interrupt routine, panic() will be called for collecting > >> + * crashdown info. > >> + * If system can not recover from WS0 interrupt routine, then second > >> + * timeout occurs, WS1(reset or higher level interrupt) is triggered. > >> + * The two timeout period can be set by WOR(32bit). > > > > The second timeout period is determined by ... > > > >> + * WOR gives a maximum watch period of around 10s at the maximum > >> + * system counter frequency. > >> + * The System Counter shall run at maximum of 400MHz. > > > > "... at the maximum system counter frequency of 400 MHz.", and drop the > > last sentence. > > For the second timeout period, I have discussed with a kdump developers, > (1)10s maybe not good enough for all the case of panic + kdump, so > maybe we still need to use WCV in the second timeout period > (2)in the second timeout period, maybe we need to programme WCV for > two reason: a, trigger WS1 to reboot system ASAP; b, feed the watchdog > without cleanning WS0 flag. > > WHY we want to feed the watchdog (keepalive) without cleanning WS0 flag?? > REASON: > (1)if the system context is large, we may need to feed the dog until > we get all the things backed up. > (2)if system goes wrong, WS0 triggered, then panic--> kdump. if we > feed the dog by WRR or programming WOR, WS0 flag will be cleaned. Once > system goes wrong again, then panic again..... > So this system will be in a panic--kdump--panic--kdump loop, have not > chance to reset. > > So if we are in the second timeout period, we may need to always programme WCV. > The crashdump kernel is supposed to reload the watchdog driver, which will ping the watchdog. If it isn't able to do that in 10 seconds, something is wrong. > >> + > >> + status = readl_relaxed(gwdt->control_base + SBSA_GWDT_WCS); > >> + if (status & SBSA_GWDT_WCS_WS1) { > >> + dev_warn(dev, "System reset by WDT(WCV: %llx)\n", > >> + sbsa_gwdt_get_wcv(wdd)); > > > > WCV here only tells us how many clock cycles were executed since the > > system started (or something like that). So I still don't understand > > why it is valuable to print that number. > > this number provides the time of system reset, I thinks that may help > admin to analyse the system failure. > It doesn't mean anything to anyone but you since it is not in a well defined time scale. Also, I would be somewhat surprised if WCV would retain its value on reset. Much more likely it is the time (in clock cycles) since reset. Guenter From mboxrd@z Thu Jan 1 00:00:00 1970 From: Guenter Roeck Subject: Re: [non-pretimeout,4/7] Watchdog: introduce ARM SBSA watchdog driver Date: Tue, 23 Jun 2015 08:21:31 -0700 Message-ID: <20150623152131.GA9990@roeck-us.net> References: <1433958452-23721-5-git-send-email-fu.wei@linaro.org> <20150611162810.GA22711@roeck-us.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Content-Disposition: inline In-Reply-To: Sender: linux-watchdog-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Fu Wei Cc: Suravee Suthikulpanit , Linaro ACPI Mailman List , linux-watchdog-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Wei Fu , G Gregory , Al Stone , Hanjun Guo , Timur Tabi , Ashwin Chaugule , Arnd Bergmann , Vipul Gandhi , Wim Van Sebroeck , Jon Masters , Leo Duran , Jon Corbet , Mark Rutland , Catalin Marinas , Will Deacon , rjw-LthD3rsA81gm4RdzfppkhA@public.gmane.org List-Id: devicetree@vger.kernel.org On Tue, Jun 23, 2015 at 09:26:35PM +0800, Fu Wei wrote: > Hi Guenter, [ ...] > > > >> + * When the first timeout occurs, WS0(SPI or LPI) is triggered, > >> + * the second timeout period(as long as the first timeout period) starts. > > > > no longer accurate if WOR is used for the second period. > > > >> + * In WS0 interrupt routine, panic() will be called for collecting > >> + * crashdown info. > >> + * If system can not recover from WS0 interrupt routine, then second > >> + * timeout occurs, WS1(reset or higher level interrupt) is triggered. > >> + * The two timeout period can be set by WOR(32bit). > > > > The second timeout period is determined by ... > > > >> + * WOR gives a maximum watch period of around 10s at the maximum > >> + * system counter frequency. > >> + * The System Counter shall run at maximum of 400MHz. > > > > "... at the maximum system counter frequency of 400 MHz.", and drop the > > last sentence. > > For the second timeout period, I have discussed with a kdump developers, > (1)10s maybe not good enough for all the case of panic + kdump, so > maybe we still need to use WCV in the second timeout period > (2)in the second timeout period, maybe we need to programme WCV for > two reason: a, trigger WS1 to reboot system ASAP; b, feed the watchdog > without cleanning WS0 flag. > > WHY we want to feed the watchdog (keepalive) without cleanning WS0 flag?? > REASON: > (1)if the system context is large, we may need to feed the dog until > we get all the things backed up. > (2)if system goes wrong, WS0 triggered, then panic--> kdump. if we > feed the dog by WRR or programming WOR, WS0 flag will be cleaned. Once > system goes wrong again, then panic again..... > So this system will be in a panic--kdump--panic--kdump loop, have not > chance to reset. > > So if we are in the second timeout period, we may need to always programme WCV. > The crashdump kernel is supposed to reload the watchdog driver, which will ping the watchdog. If it isn't able to do that in 10 seconds, something is wrong. > >> + > >> + status = readl_relaxed(gwdt->control_base + SBSA_GWDT_WCS); > >> + if (status & SBSA_GWDT_WCS_WS1) { > >> + dev_warn(dev, "System reset by WDT(WCV: %llx)\n", > >> + sbsa_gwdt_get_wcv(wdd)); > > > > WCV here only tells us how many clock cycles were executed since the > > system started (or something like that). So I still don't understand > > why it is valuable to print that number. > > this number provides the time of system reset, I thinks that may help > admin to analyse the system failure. > It doesn't mean anything to anyone but you since it is not in a well defined time scale. Also, I would be somewhat surprised if WCV would retain its value on reset. Much more likely it is the time (in clock cycles) since reset. Guenter -- To unsubscribe from this list: send the line "unsubscribe linux-watchdog" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html