All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Wilson <chris@chris-wilson.co.uk>
To: Tomas Elf <tomas.elf@intel.com>
Cc: Intel-GFX@Lists.FreeDesktop.Org
Subject: Re: [RFC 00/11] TDR/watchdog timeout support for gen8
Date: Thu, 9 Jul 2015 19:47:35 +0100	[thread overview]
Message-ID: <20150709184735.GA24462@nuc-i3427.alporthouse.com> (raw)
In-Reply-To: <1433783009-17251-1-git-send-email-tomas.elf@intel.com>

On Mon, Jun 08, 2015 at 06:03:18PM +0100, Tomas Elf wrote:
> This patch series introduces the following features:
> 
> * Feature 1: TDR (Timeout Detection and Recovery) for gen8 execlist mode.
> * Feature 2: Watchdog Timeout (a.k.a "media engine reset") for gen8.
> * Feature 3. Context Submission Status Consistency checking

The high level design is reasonable and conceptually extends the current
system in fairly obvious ways.

In terms of discussing the implementation, I think this can be phased
as:

0. Move to a per-engine hangcheck

1. Add fake-interrupt recovery for CSSC

  I think this can be done without changing the hangcheck heuristics at
  all - we just need to think carefully about recovery (which is a nice
  precursor to per-engine reset). I may be wrong, and so would like to
  be told so early on! If the fake interrupt recovery is done as part of
  the reset handler, we should have (one) fewer concurrency issues to
  worry about. (In a similar vein, I think we should move the missed
  interupt handler for legacy out of hangcheck, partly to simplify some
  very confusing code and partly so that we have fewer deviations
  between legacy/execlists paths.) It also gets us thinking about the
  consistency detection and when it is viable to do a fake-interrupt and
  when we must do a full-reset (for example, we only want to
  fake-interrupt if the consistency check says the GPU is idle, and we
  definitely want to reset everything if the GPU is executing an alien
  context.)

  A test here would be to suspend the execlists irq and wait for the
  recovery. Cheekily we could punch the irq eir by root mmio and check
  hangcheck automagically recovers.

Whilst it would be nice to add the watchdog next, since it is
conceptually quite simple and basically just a new hangcheck source with
fancy setup - fast hangcheck without soft reset makes for an easy DoS.

2. TDR

  Given that we have a consistency check and begun to extend the reset
  path, we can implement a soft reset that only skips the hung request.
  (The devil is in the details, whilst the design here looked solid, I
  think the LRC recovery code could be simplified - I didn't feel
  another requeue path was required given that we need only pretend a
  request completed (fixing up the context image as required) and then
  use the normal unqueue.) There is also quite a bit of LRC cleanup on
  the lists which would be useful here.

  Lots of tests for concurrent engine utilisation, multiple contexts,
  etc and ensuring that a hang in one does not affect independent work
  (engines, batches, contexts).

3. Watchdog

  A new fast hangcheck source. So concurrency, promotion (relative
  scoring between watchdog / hangcheck) and issue with not programming
  the ring correctly (beware the interrupt after performing a dispatch
  and programming the tail, needs either reserved space, inlining into
  the dispatch etc).

  The only thing of remark here is the uapi. It is a server feature
  (interactive clients are less likely to tolerate data being thrown
  away). Do we want an execbuf bit or context param? An execbuf bit
  allows switching between two watchdog timers (or on/off), but we
  have many more context params available than bits. Do we want to
  expose context params to set the fast/slow timeouts?

  We haven't found any GL spec that descibe controllable watchdogs, so
  the ultimate uapi requirements are unknown. My feeling is that we want
  to do the initial uapi through context params, and start with a single
  fast watchdog timeout value. This is sufficient for testing and
  probably for most use cases. This can easily be extended by adding an
  execbuf bit to switch between two values and a context param to set
  the second value. (If the second value isn't set, the execbuf bit
  dosn't do anything the watchdog is always programmed to the user
  value. If the second value is set (maybe infinite), then the execbuf
  bit is used to select which timeout to use for that batch.) But given
  that this is a server-esque feature, it is likely to be a setting the
  userspace driver imposes upon its clients, and there is unlikely to be
  the need to switch timeouts within any one client.

  The tests here would focus on the uapi and ensuring that if the client
  asks for a 60ms hang detection, then all work packets take at most
  60ms to complete. Hmm, add wait_ioctl to the list of uapi that should
  report -EIO.

>  drivers/gpu/drm/i915/i915_debugfs.c     |  146 +++++-
>  drivers/gpu/drm/i915/i915_drv.c         |  201 ++++++++
>  drivers/gpu/drm/i915/intel_lrc.c        |  858 ++++++++++++++++++++++++++++++-

The balance here feels wrong ;-)
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

  parent reply	other threads:[~2015-07-09 18:47 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-08 17:03 [RFC 00/11] TDR/watchdog timeout support for gen8 Tomas Elf
2015-06-08 17:03 ` [RFC 01/11] drm/i915: Early exit from semaphore_waits_for for execlist mode Tomas Elf
2015-06-08 17:36   ` Chris Wilson
2015-06-09 11:02     ` Tomas Elf
2015-06-16 13:44   ` Daniel Vetter
2015-06-16 15:46     ` Tomas Elf
2015-06-16 16:50       ` Chris Wilson
2015-06-16 17:07         ` Tomas Elf
2015-06-17 11:43       ` Daniel Vetter
2015-06-08 17:03 ` [RFC 02/11] drm/i915: Introduce uevent for full GPU reset Tomas Elf
2015-06-16 13:43   ` Daniel Vetter
2015-06-16 15:43     ` Tomas Elf
2015-06-16 16:55       ` Chris Wilson
2015-06-16 17:32         ` Tomas Elf
2015-06-16 19:33           ` Chris Wilson
2015-06-17 11:49             ` Daniel Vetter
2015-06-17 12:51               ` Chris Wilson
2015-06-08 17:03 ` [RFC 03/11] drm/i915: Add reset stats entry point for per-engine reset Tomas Elf
2015-06-08 17:33   ` Chris Wilson
2015-06-09 11:06     ` Tomas Elf
2015-06-16 13:48     ` Daniel Vetter
2015-06-16 13:54       ` Chris Wilson
2015-06-16 15:55         ` Daniel Vetter
2015-06-18 11:12         ` Dave Gordon
2015-06-11  9:14   ` Dave Gordon
2015-06-16 13:49   ` Daniel Vetter
2015-06-16 15:54     ` Tomas Elf
2015-06-17 11:51       ` Daniel Vetter
2015-06-08 17:03 ` [RFC 04/11] drm/i915: Adding TDR / per-engine reset support for gen8 Tomas Elf
2015-06-08 17:03 ` [RFC 05/11] drm/i915: Extending i915_gem_check_wedge to check engine reset in progress Tomas Elf
2015-06-08 17:24   ` Chris Wilson
2015-06-09 11:08     ` Tomas Elf
2015-06-09 11:11   ` Chris Wilson
2015-06-08 17:03 ` [RFC 06/11] drm/i915: Disable warnings for TDR interruptions in the display driver Tomas Elf
2015-06-08 17:53   ` Chris Wilson
2015-06-08 17:03 ` [RFC 07/11] drm/i915: Reinstate hang recovery work queue Tomas Elf
2015-06-08 17:03 ` [RFC 08/11] drm/i915: Watchdog timeout support for gen8 Tomas Elf
2015-06-08 17:03 ` [RFC 09/11] drm/i915: Fake lost context interrupts through forced CSB check Tomas Elf
2015-06-08 17:03 ` [RFC 10/11] drm/i915: Debugfs interface for per-engine hang recovery Tomas Elf
2015-06-08 17:45   ` Chris Wilson
2015-06-09 11:18     ` Tomas Elf
2015-06-09 12:27       ` Chris Wilson
2015-06-09 17:28         ` Tomas Elf
2015-06-11  9:32     ` Dave Gordon
2015-06-08 17:03 ` [RFC 11/11] drm/i915: TDR/watchdog trace points Tomas Elf
2015-06-23 10:05 ` [RFC 00/11] TDR/watchdog timeout support for gen8 Daniel Vetter
2015-06-23 10:47   ` Tomas Elf
2015-06-23 11:38     ` Daniel Vetter
2015-06-23 14:06       ` Tomas Elf
2015-06-23 15:20         ` Daniel Vetter
2015-06-23 15:35           ` Daniel Vetter
2015-06-25 10:38             ` Tomas Elf
2015-07-03 11:15 ` Mika Kuoppala
2015-07-03 17:41   ` Tomas Elf
2015-07-09 18:47 ` Chris Wilson [this message]
2015-07-10 15:24   ` Tomas Elf
2015-07-10 15:48     ` Tomas Elf
2015-07-11 18:15       ` Chris Wilson
2015-07-11 18:22     ` Chris Wilson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150709184735.GA24462@nuc-i3427.alporthouse.com \
    --to=chris@chris-wilson.co.uk \
    --cc=Intel-GFX@Lists.FreeDesktop.Org \
    --cc=tomas.elf@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.