Re: [RFC 00/11] TDR/watchdog timeout support for gen8

From: Chris Wilson <chris@chris-wilson.co.uk>
To: Tomas Elf <tomas.elf@intel.com>
Cc: Intel-GFX@Lists.FreeDesktop.Org
Subject: Re: [RFC 00/11] TDR/watchdog timeout support for gen8
Date: Sat, 11 Jul 2015 19:15:43 +0100	[thread overview]
Message-ID: <20150711181543.GD21656@nuc-i3427.alporthouse.com> (raw)
In-Reply-To: <559FE965.8070807@intel.com>

On Fri, Jul 10, 2015 at 04:48:53PM +0100, Tomas Elf wrote:
> On 10/07/2015 16:24, Tomas Elf wrote:
> >On 09/07/2015 19:47, Chris Wilson wrote:
> >>On Mon, Jun 08, 2015 at 06:03:18PM +0100, Tomas Elf wrote:
> >>>This patch series introduces the following features:
> >>>
> >>>* Feature 1: TDR (Timeout Detection and Recovery) for gen8 execlist
> >>>mode.
> >>>* Feature 2: Watchdog Timeout (a.k.a "media engine reset") for gen8.
> >>>* Feature 3. Context Submission Status Consistency checking
> >>
> >>The high level design is reasonable and conceptually extends the current
> >>system in fairly obvious ways.
> >>
> >>In terms of discussing the implementation, I think this can be phased
> >>as:
> >>
> >>0. Move to a per-engine hangcheck
> >>
> >>1. Add fake-interrupt recovery for CSSC
> >>
> >>   I think this can be done without changing the hangcheck heuristics at
> >>   all - we just need to think carefully about recovery (which is a nice
> >>   precursor to per-engine reset). I may be wrong, and so would like to
> >>   be told so early on! If the fake interrupt recovery is done as part of
> >>   the reset handler, we should have (one) fewer concurrency issues to
> >>   worry about.
> >
> >Some points about moving the CSSC out of the hang checker and into the
> >reset handler:
> >
> >1. If we deal with consistency rectification in the reset handler the
> >turnaround time becomes REALLY long:
> >
> >     a. First you have the time to detect the hang, call
> >i915_handle_error() that then raises the reset in progress flag,
> >preventing further submissions to the driver.
> >
> >     b. Then go all the way to the per-engine recovery path, only to
> >discover that we've got an inconsistency that has not been handled, fall
> >back immediately with -EAGAIN and lower the reset in progress flag, let
> >the system continue running and defer to the next hang check (another
> >hang detection period)
> >
> >     c. Once the hang has been detected AGAIN, raise the reset in
> >progress flag AGAIN and go back to the engine reset path a second time.
> >
> >     d. At the start of the engine reset path we do the second CSSC
> >detection and realise that we've got a stable inconsistency that we can
> >attempt to rectify. We can then try to rectify the inconsistency and go
> >through with the engine reset... AFTER we've checked that the
> >inconsistency rectification was indeed effective! If it's not and the
> >problem remains then we have to fail the engine recovery mode and fall
> >back to full GPU reset immediately... Which we could have done from the
> >hang checker if we had just refused to schedule hang recovery and just
> >let the context submission state inconsistency persist and let the hang
> >score keep rising until the hang score reached 2*HUNG, which would then
> >have triggered the full GPU reset fallback from the hang checker (see
> >path 9/11 for all of this)
> >
> >As you can see, dealing with context submission state inconsistency in
> >the reset path is very long-winded way of doing it and does not make it
> >more reliable. Also, it's more complicated to analyse from a concurrency
> >point of view since we need to fall back several times and raise and
> >lower the reset in progress flag, which allows driver submissions to
> >happen vs. blocks submissions. It basically becomes very difficult to
> >know what is going on.
> >
> >2. Secondly, and more importantly, if a watchdog timeout is detected and
> >we end up in the per-engine hang recovery path and have to fall back due
> >to an inconsistent context submission state at that point and the hang
> >checker is turned off then we're irrecoverably hung. Watchdog timeout is
> >supposed to work without the periodic hang checker but it won't if CSSC
> >is not ensured at all times. Which is why I chose to override the
> >i915.enable_hangcheck flag to make sure that the hang checker always
> >runs consistency pre-checking and reschedules itself if there is more
> >work pending to make sure that as long as work is pending we do
> >consistency checking asynchronously regardless of everything else so
> >that if a watchdog timeout hits we have a consistent state once the
> >watchdog timeout ends up in per-engine recovery.
> >
> >Granted, if a watchdog timeout hits after we've first detected the
> >inconsistency but not yet had time to rectify it it doesn't work if the
> >hang checker is turned off and we cannot rely on periodic hang checking
> >to schedule hang recovery in this case - so in that case we're still
> >irrecoverably stuck. We could make change here and do a one-time
> >i915.enable_hangcheck override and schedule hang recovery following this
> >point. If you think it's worth it.
> >
> >Bottom line: The consistency checking must happen at all times and
> >cannot be done as a consequence of a scheduled reset if hang checking is
> >turned off at any point.
> >
> >As far as concurrency issues in the face of CSSC is concerned,
> >disregarding the complication of handling CSSC in the recovery path and
> >relying on deferring to the next hang detection    with all of the
> >concurrency issues that entails: The question really is what kind of
> >concurrency issues we're worried about. If the hang checker determines
> >that we've got a hang then that's a stable state. If the hang checker
> >consistency pre-check determines that we've got a sustained CSSC
> >inconsistency then that's stable too. The states are not changing so
> >whatever we do will not be because we detect the state in the middle of
> >a state transition and the detection won't be subject to concurrency
> >effects. If the hang checker decides that the inconsistency needs to be
> >rectified and fakes the presumably lost interrupt and the real, presumed
> >lost, interrupt happens to come in at the same time then that's fine,
> >the CSB buffer check in the execlist interrupt handler is made to cope
> >with that. We can have X number of calls to the interrupt handler or
> >just one, the outcome is supposed to be the same - the only thing that
> >matters is captured context state changes in the CSB buffer that we act
> >upon.
> >
> >So I'm not entirely sure what concurrency issues might be reason enough
> >to move out the CSSC to the hang recovery path. In fact, I'd be more
> >inclined to create a second async task for it to make sure it's being
> >run at all times. But in that case we might as well let it stay in the
> >hang checker.
> >
> >(In a similar vein, I think we should move the missed
> >>   interupt handler for legacy out of hangcheck, partly to simplify some
> >>   very confusing code and partly so that we have fewer deviations
> >>   between legacy/execlists paths.) It also gets us thinking about the
> >>   consistency detection and when it is viable to do a fake-interrupt and
> >>   when we must do a full-reset (for example, we only want to
> >>   fake-interrupt if the consistency check says the GPU is idle, and we
> >>   definitely want to reset everything if the GPU is executing an alien
> >>   context.)
> >>
> >>   A test here would be to suspend the execlists irq and wait for the
> >>   recovery. Cheekily we could punch the irq eir by root mmio and check
> >>   hangcheck automagically recovers.
> >>
> >>Whilst it would be nice to add the watchdog next, since it is
> >>conceptually quite simple and basically just a new hangcheck source with
> >>fancy setup - fast hangcheck without soft reset makes for an easy DoS.
> >>
> >>2. TDR
> >>
> >>   Given that we have a consistency check and begun to extend the reset
> >>   path, we can implement a soft reset that only skips the hung request.
> >>   (The devil is in the details, whilst the design here looked solid, I
> >>   think the LRC recovery code could be simplified - I didn't feel
> >>   another requeue path was required given that we need only pretend a
> >>   request completed (fixing up the context image as required) and then
> >>   use the normal unqueue.) There is also quite a bit of LRC cleanup on
> >>   the lists which would be useful here.
> >
> >As far as the new requeue (or resubmission) path is concerned, you might
> >have a point here. The reason it's as involved as it is is probably
> >mostly because of all the validation that takes place in the
> >resubmission path. Meaning that once the resubmission happens at the end
> >of the per-engine hang recovery path we want to make extra sure that the
> >context that gets resubmitted in the end (the head element of the queue
> >at that point in time) is in fact the one that was passed down from the
> >per-engine hang recovery path (the context at the head of the queue at
> >the start of the hang recovery path), so that the state of the queue
> >didn't change during hang recovery. Maybe we're too paranoid here.
> >
> 
> Ah, yes, there is one crucial difference between the normal
> execlists_context_unqueue() function and
> execlists_TDR_context_unqueue() that means that we cannot fully
> reuse the former one for TDR-specific purposes.
> 
> When we resubmit the context via TDR it's important that we do not
> increment the elsp_submitted counter and otherwise treat the
> resubmission as a normal submission. The reason for this is that the
> hardware consistently refuses to send out any kind of interrupt
> acknowledging the context resubmission. If you just submit the
> context and increment elsp_submitted for that context, like you
> normally do, the interrupt handler will sit around forever waiting
> for the interrupt that will never come for the resubmission. It will
> wait - and receive - the interrupt for the original submission that
> caused the original hang and also the interrupt for the context
> completion following hang recovery. But it won't receive an
> interrupt for the resubmission.

But that's just an issue with the current poor design of execlists...

If we tracked exactly what request we submitted to each port, and a
unique ctx id tag for every submission (incl subsumed ones for easier
completion handling) you completely eliminate the need for elsp_submitted.
And a TDR resubmit is just that. (And also the code is much smaller and
much simpler.)

So I don't think you have sufficiently reworked execlists to avoid
unnecessary duplication.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx