cpuidle: Kernel panics with AMD Opteron 6300 entering C2

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* cpuidle: Kernel panics with AMD Opteron 6300 entering C2 - clock related
@ 2015-06-17 10:06 Sebastian Parschauer
  2015-06-18  9:22 ` Daniel Lezcano
  0 siblings, 1 reply; 7+ messages in thread
From: Sebastian Parschauer @ 2015-06-17 10:06 UTC (permalink / raw)
  To: Rafael J. Wysocki, Daniel Lezcano; +Cc: linux-pm

Hi cpuidle maintainers,

we notice kernel panics with CPUs from the AMD Opteron 6300 series and
kernel 3.12 when entering C2. In that C-state the clock is shut down but
the flag CPUIDLE_FLAG_TIMER_STOP isn't set. We use the TSC clock source
for performance as our servers host KVM VMs. During the panics
interrupts are enabled again and the timer interrupt corrupts the
instruction pointer and/or the stack pointer.

Would it help to set the flag CPUIDLE_FLAG_TIMER_STOP for C2?
Or how to fix this?

Thanks,
Sebastian

==========
Additional debug info:

BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<          (null)>]           (null)
...
Call trace:
[<ffffffff815af9b5>] cpuidle_idle_call+0xc5/0x150
[<ffffffff8100b529>] arch_cpu_idle+0x9/0x20
[<ffffffff81092e6f>] cpu_startup_entry+0xaf/0x240
[<ffffffff8102df4b>] start_secondary+0x1db/0x240

The CPUs provide three C-states:
0: POLL
1: C1
2: C2

C2 information from the crash dump:

> {
>       name = "C2\000\000\000\000\000\000\000\000\000\000\000\000\000", 
>       desc = "ACPI IOPORT 0x815\000\000\000\000\000\000\000\000\000\000\000\000\000\000", 
>       flags = 1, 
>       exit_latency = 100, 
>       power_usage = 0, 
>       target_residency = 200, 
>       disabled = false, 
>       enter = 0xffffffffa00ab026 <acpi_idle_enter_simple>, 
>       enter_dead = 0xffffffffa00aa39c <acpi_idle_play_dead>
> }

Assembly level analysis:

> RDX: 0000000225c17d03

So EDX is 00000002 and that's the entered state C2.

> RDI: ffffffff81c15540
> ..
> crash> info symbol 0xffffffff81c15540
> clocksource_tsc in section .data
> 
> crash> disassemble cpuidle_enter_state
> ...
>    0xffffffff815af5fc <+60>:    callq  0xffffffff8109b360 <ktime_get>
>    0xffffffff815af601 <+65>:    sti    
>    0xffffffff815af602 <+66>:    sub    %r13,%rax <- here rdi still points to clocksource_tsc
>    0xffffffff815af605 <+69>:    mov    %rax,%rdi <- rdi is overwritten by the ktime_get return address

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: cpuidle: Kernel panics with AMD Opteron 6300 entering C2 - clock related
  2015-06-17 10:06 cpuidle: Kernel panics with AMD Opteron 6300 entering C2 - clock related Sebastian Parschauer
@ 2015-06-18  9:22 ` Daniel Lezcano
  2015-06-18 10:52   ` Sebastian Parschauer
  0 siblings, 1 reply; 7+ messages in thread
From: Daniel Lezcano @ 2015-06-18  9:22 UTC (permalink / raw)
  To: Sebastian Parschauer, Rafael J. Wysocki; +Cc: linux-pm

On 06/17/2015 12:06 PM, Sebastian Parschauer wrote:
> Hi cpuidle maintainers,
>
> we notice kernel panics with CPUs from the AMD Opteron 6300 series and
> kernel 3.12 when entering C2. In that C-state the clock is shut down but
> the flag CPUIDLE_FLAG_TIMER_STOP isn't set. We use the TSC clock source
> for performance as our servers host KVM VMs. During the panics
> interrupts are enabled again and the timer interrupt corrupts the
> instruction pointer and/or the stack pointer.
>
> Would it help to set the flag CPUIDLE_FLAG_TIMER_STOP for C2?
> Or how to fix this?

Did you try the flag ? Does it fix it ?

> ==========
> Additional debug info:
>
> BUG: unable to handle kernel NULL pointer dereference at           (null)
> IP: [<          (null)>]           (null)
> ...
> Call trace:
> [<ffffffff815af9b5>] cpuidle_idle_call+0xc5/0x150
> [<ffffffff8100b529>] arch_cpu_idle+0x9/0x20
> [<ffffffff81092e6f>] cpu_startup_entry+0xaf/0x240
> [<ffffffff8102df4b>] start_secondary+0x1db/0x240
>
> The CPUs provide three C-states:
> 0: POLL
> 1: C1
> 2: C2
>
> C2 information from the crash dump:
>
>> {
>>        name = "C2\000\000\000\000\000\000\000\000\000\000\000\000\000",
>>        desc = "ACPI IOPORT 0x815\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
>>        flags = 1,
>>        exit_latency = 100,
>>        power_usage = 0,
>>        target_residency = 200,
>>        disabled = false,
>>        enter = 0xffffffffa00ab026 <acpi_idle_enter_simple>,
>>        enter_dead = 0xffffffffa00aa39c <acpi_idle_play_dead>
>> }
>
> Assembly level analysis:
>
>> RDX: 0000000225c17d03
>
> So EDX is 00000002 and that's the entered state C2.
>
>> RDI: ffffffff81c15540
>> ..
>> crash> info symbol 0xffffffff81c15540
>> clocksource_tsc in section .data
>>
>> crash> disassemble cpuidle_enter_state
>> ...
>>     0xffffffff815af5fc <+60>:    callq  0xffffffff8109b360 <ktime_get>
>>     0xffffffff815af601 <+65>:    sti
>>     0xffffffff815af602 <+66>:    sub    %r13,%rax <- here rdi still points to clocksource_tsc
>>     0xffffffff815af605 <+69>:    mov    %rax,%rdi <- rdi is overwritten by the ktime_get return address


-- 
  <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: cpuidle: Kernel panics with AMD Opteron 6300 entering C2 - clock related
  2015-06-18  9:22 ` Daniel Lezcano
@ 2015-06-18 10:52   ` Sebastian Parschauer
  2015-06-18 11:21     ` Sebastian Parschauer
  2015-06-18 13:23     ` Daniel Lezcano
  0 siblings, 2 replies; 7+ messages in thread
From: Sebastian Parschauer @ 2015-06-18 10:52 UTC (permalink / raw)
  To: Daniel Lezcano, Rafael J. Wysocki; +Cc: linux-pm, Sebastian Parschauer

On 18.06.2015 11:22, Daniel Lezcano wrote:
> On 06/17/2015 12:06 PM, Sebastian Parschauer wrote:
>> Hi cpuidle maintainers,
>>
>> we notice kernel panics with CPUs from the AMD Opteron 6300 series and
>> kernel 3.12 when entering C2. In that C-state the clock is shut down but
>> the flag CPUIDLE_FLAG_TIMER_STOP isn't set. We use the TSC clock source
>> for performance as our servers host KVM VMs. During the panics
>> interrupts are enabled again and the timer interrupt corrupts the
>> instruction pointer and/or the stack pointer.
>>
>> Would it help to set the flag CPUIDLE_FLAG_TIMER_STOP for C2?
>> Or how to fix this?
> 
> Did you try the flag ? Does it fix it ?

Thanks for your reply. Unfortunately we can't roll out new kernels fast
(VMs have to be migrated). But we've disabled the C2 state via sysfs for
all CPU cores and all servers and had one more kernel panic with the
same call trace although C2 was (or should have been) disabled. We use
the menu governor and a v3.12.40 kernel.

It's strange to me coming into the same code path with state index 2 as
parameter again. I think I'll prepare a kernel with some debug messages
when transitioning from one state to another and deploy it to a test system.

Is there any better method to debug the cpuidle driver?

How do you guys test it?

Can we provide any missing additional information?

Maybe something else corrupts the memory in an interrupt and the cpuidle
driver is just the one noticing an unrelated problem.

>> ==========
>> Additional debug info:
>>
>> BUG: unable to handle kernel NULL pointer dereference at           (null)
>> IP: [<          (null)>]           (null)
>> ...
>> Call trace:
>> [<ffffffff815af9b5>] cpuidle_idle_call+0xc5/0x150
>> [<ffffffff8100b529>] arch_cpu_idle+0x9/0x20
>> [<ffffffff81092e6f>] cpu_startup_entry+0xaf/0x240
>> [<ffffffff8102df4b>] start_secondary+0x1db/0x240
>>
>> The CPUs provide three C-states:
>> 0: POLL
>> 1: C1
>> 2: C2
>>
>> C2 information from the crash dump:
>>
>>> {
>>>        name = "C2\000\000\000\000\000\000\000\000\000\000\000\000\000",
>>>        desc = "ACPI IOPORT
>>> 0x815\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
>>>        flags = 1,
>>>        exit_latency = 100,
>>>        power_usage = 0,
>>>        target_residency = 200,
>>>        disabled = false,
>>>        enter = 0xffffffffa00ab026 <acpi_idle_enter_simple>,
>>>        enter_dead = 0xffffffffa00aa39c <acpi_idle_play_dead>
>>> }
>>
>> Assembly level analysis:
>>
>>> RDX: 0000000225c17d03
>>
>> So EDX is 00000002 and that's the entered state C2.
>>
>>> RDI: ffffffff81c15540
>>> ..
>>> crash> info symbol 0xffffffff81c15540
>>> clocksource_tsc in section .data
>>>
>>> crash> disassemble cpuidle_enter_state
>>> ...
>>>     0xffffffff815af5fc <+60>:    callq  0xffffffff8109b360 <ktime_get>
>>>     0xffffffff815af601 <+65>:    sti
>>>     0xffffffff815af602 <+66>:    sub    %r13,%rax <- here rdi still
>>> points to clocksource_tsc
>>>     0xffffffff815af605 <+69>:    mov    %rax,%rdi <- rdi is
>>> overwritten by the ktime_get return address
> 
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: cpuidle: Kernel panics with AMD Opteron 6300 entering C2 - clock related
  2015-06-18 10:52   ` Sebastian Parschauer
@ 2015-06-18 11:21     ` Sebastian Parschauer
  2015-06-18 13:28       ` Daniel Lezcano
  2015-06-18 13:23     ` Daniel Lezcano
  1 sibling, 1 reply; 7+ messages in thread
From: Sebastian Parschauer @ 2015-06-18 11:21 UTC (permalink / raw)
  To: Daniel Lezcano, Rafael J. Wysocki; +Cc: linux-pm

On 18.06.2015 12:52, Sebastian Parschauer wrote:
> On 18.06.2015 11:22, Daniel Lezcano wrote:
>> On 06/17/2015 12:06 PM, Sebastian Parschauer wrote:
>>> Hi cpuidle maintainers,
>>>
>>> we notice kernel panics with CPUs from the AMD Opteron 6300 series and
>>> kernel 3.12 when entering C2. In that C-state the clock is shut down but
>>> the flag CPUIDLE_FLAG_TIMER_STOP isn't set. We use the TSC clock source
>>> for performance as our servers host KVM VMs. During the panics
>>> interrupts are enabled again and the timer interrupt corrupts the
>>> instruction pointer and/or the stack pointer.
>>>
>>> Would it help to set the flag CPUIDLE_FLAG_TIMER_STOP for C2?
>>> Or how to fix this?
>>
>> Did you try the flag ? Does it fix it ?
> 
> Thanks for your reply. Unfortunately we can't roll out new kernels fast
> (VMs have to be migrated). But we've disabled the C2 state via sysfs for
> all CPU cores and all servers and had one more kernel panic with the
> same call trace although C2 was (or should have been) disabled. We use
> the menu governor and a v3.12.40 kernel.
> 
> It's strange to me coming into the same code path with state index 2 as
> parameter again. I think I'll prepare a kernel with some debug messages
> when transitioning from one state to another and deploy it to a test system.
> 
> Is there any better method to debug the cpuidle driver?
> 
> How do you guys test it?
> 
> Can we provide any missing additional information?
> 
> Maybe something else corrupts the memory in an interrupt and the cpuidle
> driver is just the one noticing an unrelated problem.

Sorry, I had a closer look at the most recent crash again. It happened
at entering C1 with disabled C2. So maybe our problem is not cpuidle
related.

> 
>>> ==========
>>> Additional debug info:
>>>
>>> BUG: unable to handle kernel NULL pointer dereference at           (null)
>>> IP: [<          (null)>]           (null)
>>> ...
>>> Call trace:
>>> [<ffffffff815af9b5>] cpuidle_idle_call+0xc5/0x150
>>> [<ffffffff8100b529>] arch_cpu_idle+0x9/0x20
>>> [<ffffffff81092e6f>] cpu_startup_entry+0xaf/0x240
>>> [<ffffffff8102df4b>] start_secondary+0x1db/0x240
>>>
>>> The CPUs provide three C-states:
>>> 0: POLL
>>> 1: C1
>>> 2: C2
>>>
>>> C2 information from the crash dump:
>>>
>>>> {
>>>>        name = "C2\000\000\000\000\000\000\000\000\000\000\000\000\000",
>>>>        desc = "ACPI IOPORT
>>>> 0x815\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
>>>>        flags = 1,
>>>>        exit_latency = 100,
>>>>        power_usage = 0,
>>>>        target_residency = 200,
>>>>        disabled = false,
>>>>        enter = 0xffffffffa00ab026 <acpi_idle_enter_simple>,
>>>>        enter_dead = 0xffffffffa00aa39c <acpi_idle_play_dead>
>>>> }
>>>
>>> Assembly level analysis:
>>>
>>>> RDX: 0000000225c17d03
>>>
>>> So EDX is 00000002 and that's the entered state C2.
>>>
>>>> RDI: ffffffff81c15540
>>>> ..
>>>> crash> info symbol 0xffffffff81c15540
>>>> clocksource_tsc in section .data
>>>>
>>>> crash> disassemble cpuidle_enter_state
>>>> ...
>>>>     0xffffffff815af5fc <+60>:    callq  0xffffffff8109b360 <ktime_get>
>>>>     0xffffffff815af601 <+65>:    sti
>>>>     0xffffffff815af602 <+66>:    sub    %r13,%rax <- here rdi still
>>>> points to clocksource_tsc
>>>>     0xffffffff815af605 <+69>:    mov    %rax,%rdi <- rdi is
>>>> overwritten by the ktime_get return address
>>
>>
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: cpuidle: Kernel panics with AMD Opteron 6300 entering C2 - clock related
  2015-06-18 10:52   ` Sebastian Parschauer
  2015-06-18 11:21     ` Sebastian Parschauer
@ 2015-06-18 13:23     ` Daniel Lezcano
  1 sibling, 0 replies; 7+ messages in thread
From: Daniel Lezcano @ 2015-06-18 13:23 UTC (permalink / raw)
  To: Sebastian Parschauer, Rafael J. Wysocki; +Cc: linux-pm

On 06/18/2015 12:52 PM, Sebastian Parschauer wrote:
> On 18.06.2015 11:22, Daniel Lezcano wrote:
>> On 06/17/2015 12:06 PM, Sebastian Parschauer wrote:
>>> Hi cpuidle maintainers,
>>>
>>> we notice kernel panics with CPUs from the AMD Opteron 6300 series and
>>> kernel 3.12 when entering C2. In that C-state the clock is shut down but
>>> the flag CPUIDLE_FLAG_TIMER_STOP isn't set. We use the TSC clock source
>>> for performance as our servers host KVM VMs. During the panics
>>> interrupts are enabled again and the timer interrupt corrupts the
>>> instruction pointer and/or the stack pointer.
>>>
>>> Would it help to set the flag CPUIDLE_FLAG_TIMER_STOP for C2?
>>> Or how to fix this?
>>
>> Did you try the flag ? Does it fix it ?
>
> Thanks for your reply. Unfortunately we can't roll out new kernels fast
> (VMs have to be migrated). But we've disabled the C2 state via sysfs for
> all CPU cores and all servers and had one more kernel panic with the
> same call trace although C2 was (or should have been) disabled. We use
> the menu governor and a v3.12.40 kernel.
>
> It's strange to me coming into the same code path with state index 2 as
> parameter again. I think I'll prepare a kernel with some debug messages
> when transitioning from one state to another and deploy it to a test system.

It is weird you disabled the state index 2 and the system enters with 
this index again.

Are you sure you disabled effectively for all cores on the system this 
state ?

Furthermore, if I am not wrong the C state on AMD differs a bit from the 
C-state intel's semantic.

The firmware will put the cluster down if all core go to the C1 state, no ?

> Is there any better method to debug the cpuidle driver?

You can try by passing to the kernel command line:

processor.max_cstate=1

Note, that does not guarantee the firmware won't promote to a deeper 
idle state.

If the kernel panics again, may be in the BIOS, there is an option to 
set max idle states for the firmware.

> How do you guys test it?

On the x86 platform, most of the magic is in the firmware, so if there 
is a bug there, hmm ... that will be hard to spot.

> Can we provide any missing additional information?
>
> Maybe something else corrupts the memory in an interrupt and the cpuidle
> driver is just the one noticing an unrelated problem.



>>> ==========
>>> Additional debug info:
>>>
>>> BUG: unable to handle kernel NULL pointer dereference at           (null)
>>> IP: [<          (null)>]           (null)
>>> ...
>>> Call trace:
>>> [<ffffffff815af9b5>] cpuidle_idle_call+0xc5/0x150
>>> [<ffffffff8100b529>] arch_cpu_idle+0x9/0x20
>>> [<ffffffff81092e6f>] cpu_startup_entry+0xaf/0x240
>>> [<ffffffff8102df4b>] start_secondary+0x1db/0x240
>>>
>>> The CPUs provide three C-states:
>>> 0: POLL
>>> 1: C1
>>> 2: C2
>>>
>>> C2 information from the crash dump:
>>>
>>>> {
>>>>         name = "C2\000\000\000\000\000\000\000\000\000\000\000\000\000",
>>>>         desc = "ACPI IOPORT
>>>> 0x815\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
>>>>         flags = 1,
>>>>         exit_latency = 100,
>>>>         power_usage = 0,
>>>>         target_residency = 200,
>>>>         disabled = false,
>>>>         enter = 0xffffffffa00ab026 <acpi_idle_enter_simple>,
>>>>         enter_dead = 0xffffffffa00aa39c <acpi_idle_play_dead>
>>>> }
>>>
>>> Assembly level analysis:
>>>
>>>> RDX: 0000000225c17d03
>>>
>>> So EDX is 00000002 and that's the entered state C2.
>>>
>>>> RDI: ffffffff81c15540
>>>> ..
>>>> crash> info symbol 0xffffffff81c15540
>>>> clocksource_tsc in section .data
>>>>
>>>> crash> disassemble cpuidle_enter_state
>>>> ...
>>>>      0xffffffff815af5fc <+60>:    callq  0xffffffff8109b360 <ktime_get>
>>>>      0xffffffff815af601 <+65>:    sti
>>>>      0xffffffff815af602 <+66>:    sub    %r13,%rax <- here rdi still
>>>> points to clocksource_tsc
>>>>      0xffffffff815af605 <+69>:    mov    %rax,%rdi <- rdi is
>>>> overwritten by the ktime_get return address
>>
>>
>


-- 
  <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: cpuidle: Kernel panics with AMD Opteron 6300 entering C2 - clock related
  2015-06-18 11:21     ` Sebastian Parschauer
@ 2015-06-18 13:28       ` Daniel Lezcano
  2015-06-18 14:09         ` Sebastian Parschauer
  0 siblings, 1 reply; 7+ messages in thread
From: Daniel Lezcano @ 2015-06-18 13:28 UTC (permalink / raw)
  To: Sebastian Parschauer, Rafael J. Wysocki; +Cc: linux-pm

On 06/18/2015 01:21 PM, Sebastian Parschauer wrote:
> On 18.06.2015 12:52, Sebastian Parschauer wrote:
>> On 18.06.2015 11:22, Daniel Lezcano wrote:
>>> On 06/17/2015 12:06 PM, Sebastian Parschauer wrote:
>>>> Hi cpuidle maintainers,
>>>>
>>>> we notice kernel panics with CPUs from the AMD Opteron 6300 series and
>>>> kernel 3.12 when entering C2. In that C-state the clock is shut down but
>>>> the flag CPUIDLE_FLAG_TIMER_STOP isn't set. We use the TSC clock source
>>>> for performance as our servers host KVM VMs. During the panics
>>>> interrupts are enabled again and the timer interrupt corrupts the
>>>> instruction pointer and/or the stack pointer.
>>>>
>>>> Would it help to set the flag CPUIDLE_FLAG_TIMER_STOP for C2?
>>>> Or how to fix this?
>>>
>>> Did you try the flag ? Does it fix it ?
>>
>> Thanks for your reply. Unfortunately we can't roll out new kernels fast
>> (VMs have to be migrated). But we've disabled the C2 state via sysfs for
>> all CPU cores and all servers and had one more kernel panic with the
>> same call trace although C2 was (or should have been) disabled. We use
>> the menu governor and a v3.12.40 kernel.
>>
>> It's strange to me coming into the same code path with state index 2 as
>> parameter again. I think I'll prepare a kernel with some debug messages
>> when transitioning from one state to another and deploy it to a test system.
>>
>> Is there any better method to debug the cpuidle driver?
>>
>> How do you guys test it?
>>
>> Can we provide any missing additional information?
>>
>> Maybe something else corrupts the memory in an interrupt and the cpuidle
>> driver is just the one noticing an unrelated problem.
>
> Sorry, I had a closer look at the most recent crash again. It happened
> at entering C1 with disabled C2. So maybe our problem is not cpuidle
> related.

As mentioned in the previous email, disabling the idle state index 2 in 
the kernel does not prevent the firmware to auto-promote to this state.

By the way, I am not sure this is really the C2 state but the idle state 
index 2. Could you give the C state name you have in the sysfs directory ?


-- 
  <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: cpuidle: Kernel panics with AMD Opteron 6300 entering C2 - clock related
  2015-06-18 13:28       ` Daniel Lezcano
@ 2015-06-18 14:09         ` Sebastian Parschauer
  0 siblings, 0 replies; 7+ messages in thread
From: Sebastian Parschauer @ 2015-06-18 14:09 UTC (permalink / raw)
  To: Daniel Lezcano, Rafael J. Wysocki; +Cc: linux-pm

On 18.06.2015 15:28, Daniel Lezcano wrote:
> On 06/18/2015 01:21 PM, Sebastian Parschauer wrote:
>> On 18.06.2015 12:52, Sebastian Parschauer wrote:
>>> On 18.06.2015 11:22, Daniel Lezcano wrote:
>>>> On 06/17/2015 12:06 PM, Sebastian Parschauer wrote:
>>>>> Hi cpuidle maintainers,
>>>>>
>>>>> we notice kernel panics with CPUs from the AMD Opteron 6300 series and
>>>>> kernel 3.12 when entering C2. In that C-state the clock is shut
>>>>> down but
>>>>> the flag CPUIDLE_FLAG_TIMER_STOP isn't set. We use the TSC clock
>>>>> source
>>>>> for performance as our servers host KVM VMs. During the panics
>>>>> interrupts are enabled again and the timer interrupt corrupts the
>>>>> instruction pointer and/or the stack pointer.
>>>>>
>>>>> Would it help to set the flag CPUIDLE_FLAG_TIMER_STOP for C2?
>>>>> Or how to fix this?
>>>>
>>>> Did you try the flag ? Does it fix it ?
>>>
>>> Thanks for your reply. Unfortunately we can't roll out new kernels fast
>>> (VMs have to be migrated). But we've disabled the C2 state via sysfs for
>>> all CPU cores and all servers and had one more kernel panic with the
>>> same call trace although C2 was (or should have been) disabled. We use
>>> the menu governor and a v3.12.40 kernel.
>>>
>>> It's strange to me coming into the same code path with state index 2 as
>>> parameter again. I think I'll prepare a kernel with some debug messages
>>> when transitioning from one state to another and deploy it to a test
>>> system.
>>>
>>> Is there any better method to debug the cpuidle driver?
>>>
>>> How do you guys test it?
>>>
>>> Can we provide any missing additional information?
>>>
>>> Maybe something else corrupts the memory in an interrupt and the cpuidle
>>> driver is just the one noticing an unrelated problem.
>>
>> Sorry, I had a closer look at the most recent crash again. It happened
>> at entering C1 with disabled C2. So maybe our problem is not cpuidle
>> related.
> 
> As mentioned in the previous email, disabling the idle state index 2 in
> the kernel does not prevent the firmware to auto-promote to this state.
> 
> By the way, I am not sure this is really the C2 state but the idle state
> index 2. Could you give the C state name you have in the sysfs directory ?

state0: POLL
state1: C1
state2: C2

We also see it in the crash dump that C2 is drv->states[2].

Thanks for the detailed information! That helps a lot. Yes, we have the
possibility to set the max. allowed C-state in the BIOS. We'll do a
research on this. We are only rarely in C2 and can't afford to disable
C1. So setting C1 as the maximum C-state in the BIOS could be an option.
Thanks!

Cheers,
Sebastian

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-06-18 14:09 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-17 10:06 cpuidle: Kernel panics with AMD Opteron 6300 entering C2 - clock related Sebastian Parschauer
2015-06-18  9:22 ` Daniel Lezcano
2015-06-18 10:52   ` Sebastian Parschauer
2015-06-18 11:21     ` Sebastian Parschauer
2015-06-18 13:28       ` Daniel Lezcano
2015-06-18 14:09         ` Sebastian Parschauer
2015-06-18 13:23     ` Daniel Lezcano

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.