Re: [clocksource] 388450c708: netperf.Throughput_tps -65.1% regression

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

From: Feng Tang <feng.tang@intel.com>
To: "Paul E. McKenney" <paulmck@kernel.org>
Cc: kernel test robot <oliver.sang@intel.com>,
	0day robot <lkp@intel.com>, Thomas Gleixner <tglx@linutronix.de>,
	John Stultz <john.stultz@linaro.org>,
	Stephen Boyd <sboyd@kernel.org>, Jonathan Corbet <corbet@lwn.net>,
	Mark Rutland <Mark.Rutland@arm.com>,
	Marc Zyngier <maz@kernel.org>, Andi Kleen <ak@linux.intel.com>,
	Xing Zhengjun <zhengjun.xing@linux.intel.com>,
	LKML <linux-kernel@vger.kernel.org>,
	lkp@lists.01.org, ying.huang@intel.com, zhengjun.xing@intel.com,
	kernel-team@fb.com, neeraju@codeaurora.org
Subject: Re: [clocksource]  388450c708:  netperf.Throughput_tps -65.1% regression
Date: Sun, 16 May 2021 14:34:19 +0800	[thread overview]
Message-ID: <20210516063419.GA22111@shbuild999.sh.intel.com> (raw)
In-Reply-To: <20210514174908.GI975577@paulmck-ThinkPad-P17-Gen-1>

On Fri, May 14, 2021 at 10:49:08AM -0700, Paul E. McKenney wrote:
> On Fri, May 14, 2021 at 03:43:14PM +0800, Feng Tang wrote:
> > Hi Paul,
> > 
> > On Thu, May 13, 2021 at 10:07:07AM -0700, Paul E. McKenney wrote:
> > > On Thu, May 13, 2021 at 11:55:15PM +0800, kernel test robot wrote:
> > > > 
> > > > 
> > > > Greeting,
> > > > 
> > > > FYI, we noticed a -65.1% regression of netperf.Throughput_tps due to commit:
> > > > 
> > > > 
> > > > commit: 388450c7081ded73432e2b7148c1bb9a0b039963 ("[PATCH v12 clocksource 4/5] clocksource: Reduce clocksource-skew threshold for TSC")
> > > > url: https://github.com/0day-ci/linux/commits/Paul-E-McKenney/Do-not-mark-clocks-unstable-due-to-delays-for-v5-13/20210501-083404
> > > > base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 2d036dfa5f10df9782f5278fc591d79d283c1fad
> > > > 
> > > > in testcase: netperf
> > > > on test machine: 96 threads 2 sockets Ice Lake with 256G memory
> > > > with following parameters:
> > > > 
> > > > 	ip: ipv4
> > > > 	runtime: 300s
> > > > 	nr_threads: 25%
> > > > 	cluster: cs-localhost
> > > > 	test: UDP_RR
> > > > 	cpufreq_governor: performance
> > > > 	ucode: 0xb000280
> > > > 
> > > > test-description: Netperf is a benchmark that can be use to measure various aspect of networking performance.
> > > > test-url: http://www.netperf.org/netperf/
> > > > 
> > > > 
> > > > 
> > > > If you fix the issue, kindly add following tag
> > > > Reported-by: kernel test robot <oliver.sang@intel.com>
> > > > 
> > > > 
> > > > also as Feng Tang checked, this is a "unstable clocksource" case.
> > > > attached dmesg FYI.
> > > 
> > > Agreed, given the clock-skew event and the resulting switch to HPET,
> > > performance regressions are expected behavior.
> > > 
> > > That dmesg output does demonstrate the value of Feng Tang's patch!
> > > 
> > > I don't see how to obtain the values of ->mult and ->shift that would
> > > allow me to compute the delta.  So if you don't tell me otherwise, I
> > > will assume that the skew itself was expected on this hardware, perhaps
> > > somehow due to the tpm_tis_status warning immediately preceding the
> > > clock-skew event.  If my assumption is incorrect, please let me know.
> > 
> > I run the case with the debug patch applied, the info is:
> > 
> > [   13.796429] clocksource: timekeeping watchdog on CPU19: Marking clocksource 'tsc' as unstable because the skew is too large:
> > [   13.797413] clocksource:                       'hpet' wd_nesc: 505192062 wd_now: 10657158 wd_last: fac6f97 mask: ffffffff
> > [   13.797413] clocksource:                       'tsc' cs_nsec: 504008008 cs_now: 3445570292aa5 cs_last: 344551f0cad6f mask: ffffffffffffffff
> > [   13.797413] clocksource:                       'tsc' is current clocksource.
> > [   13.797413] tsc: Marking TSC unstable due to clocksource watchdog
> > [   13.844513] clocksource: Checking clocksource tsc synchronization from CPU 50 to CPUs 0-1,12,22,32-33,60,65.
> > [   13.855080] clocksource: Switched to clocksource hpet
> > 
> > So the delta is 1184 us (505192062 - 504008008), and I agree with
> > you that it should be related with the tpm_tis_status warning stuff.
> > 
> > But this re-trigger my old concerns, that if the margins calculated
> > for tsc, hpet are too small?
> 
> If the error really did disturb either tsc or hpet, then we really
> do not have a false positive, and nothing should change (aside from
> perhaps documenting that TPM issues can disturb the clocks, or better
> yet treating that perturbation as a separate bug that should be fixed).
> But if this is yet another way to get a confused measurement, then it
> would be better to work out a way to reject the confusion and keep the
> tighter margins.  I cannot think right off of a way that this could
> cause measurement confusion, but you never know.

I have no doubt in the correctness of the measuring method, but was
just afraid some platforms which use to 'just work' will be caught :)

> So any thoughts on exactly how the tpm_tis_status warning might have
> resulted in the skew?

The tpm error message has been reported before, and from google there
were some similar errors, we'll do some further check.

> > With current math algorithm, the 'uncertainty_margin' is
> > calculated against the frequency, and those tsc/hpet/acpi_pm
> > timer is multiple of MHz or GHz, which gives them to have margin of
> > 100 us. It works with normal systems. But in the wild world, there
> > could be some sparkles due to some immature HW components, their
> > firmwares or drivers etc, just like this case. 
> 
> Isn't diagnosing issues from immature hardware, firmware, and drivers
> actually a benefit?  It would after all be quite unfortunate if some issue
> that was visible only due to clock skew were to escape into production.

Yes, it surely will expose some mal-functional cases which haven't
been caught before.

Thanks,
Feng


> 							Thanx, Paul

WARNING: multiple messages have this Message-ID (diff)

From: Feng Tang <feng.tang@intel.com>
To: lkp@lists.01.org
Subject: Re: [clocksource] 388450c708: netperf.Throughput_tps -65.1% regression
Date: Sun, 16 May 2021 14:34:19 +0800	[thread overview]
Message-ID: <20210516063419.GA22111@shbuild999.sh.intel.com> (raw)
In-Reply-To: <20210514174908.GI975577@paulmck-ThinkPad-P17-Gen-1>

[-- Attachment #1: Type: text/plain, Size: 5043 bytes --]

On Fri, May 14, 2021 at 10:49:08AM -0700, Paul E. McKenney wrote:
> On Fri, May 14, 2021 at 03:43:14PM +0800, Feng Tang wrote:
> > Hi Paul,
> > 
> > On Thu, May 13, 2021 at 10:07:07AM -0700, Paul E. McKenney wrote:
> > > On Thu, May 13, 2021 at 11:55:15PM +0800, kernel test robot wrote:
> > > > 
> > > > 
> > > > Greeting,
> > > > 
> > > > FYI, we noticed a -65.1% regression of netperf.Throughput_tps due to commit:
> > > > 
> > > > 
> > > > commit: 388450c7081ded73432e2b7148c1bb9a0b039963 ("[PATCH v12 clocksource 4/5] clocksource: Reduce clocksource-skew threshold for TSC")
> > > > url: https://github.com/0day-ci/linux/commits/Paul-E-McKenney/Do-not-mark-clocks-unstable-due-to-delays-for-v5-13/20210501-083404
> > > > base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 2d036dfa5f10df9782f5278fc591d79d283c1fad
> > > > 
> > > > in testcase: netperf
> > > > on test machine: 96 threads 2 sockets Ice Lake with 256G memory
> > > > with following parameters:
> > > > 
> > > > 	ip: ipv4
> > > > 	runtime: 300s
> > > > 	nr_threads: 25%
> > > > 	cluster: cs-localhost
> > > > 	test: UDP_RR
> > > > 	cpufreq_governor: performance
> > > > 	ucode: 0xb000280
> > > > 
> > > > test-description: Netperf is a benchmark that can be use to measure various aspect of networking performance.
> > > > test-url: http://www.netperf.org/netperf/
> > > > 
> > > > 
> > > > 
> > > > If you fix the issue, kindly add following tag
> > > > Reported-by: kernel test robot <oliver.sang@intel.com>
> > > > 
> > > > 
> > > > also as Feng Tang checked, this is a "unstable clocksource" case.
> > > > attached dmesg FYI.
> > > 
> > > Agreed, given the clock-skew event and the resulting switch to HPET,
> > > performance regressions are expected behavior.
> > > 
> > > That dmesg output does demonstrate the value of Feng Tang's patch!
> > > 
> > > I don't see how to obtain the values of ->mult and ->shift that would
> > > allow me to compute the delta.  So if you don't tell me otherwise, I
> > > will assume that the skew itself was expected on this hardware, perhaps
> > > somehow due to the tpm_tis_status warning immediately preceding the
> > > clock-skew event.  If my assumption is incorrect, please let me know.
> > 
> > I run the case with the debug patch applied, the info is:
> > 
> > [   13.796429] clocksource: timekeeping watchdog on CPU19: Marking clocksource 'tsc' as unstable because the skew is too large:
> > [   13.797413] clocksource:                       'hpet' wd_nesc: 505192062 wd_now: 10657158 wd_last: fac6f97 mask: ffffffff
> > [   13.797413] clocksource:                       'tsc' cs_nsec: 504008008 cs_now: 3445570292aa5 cs_last: 344551f0cad6f mask: ffffffffffffffff
> > [   13.797413] clocksource:                       'tsc' is current clocksource.
> > [   13.797413] tsc: Marking TSC unstable due to clocksource watchdog
> > [   13.844513] clocksource: Checking clocksource tsc synchronization from CPU 50 to CPUs 0-1,12,22,32-33,60,65.
> > [   13.855080] clocksource: Switched to clocksource hpet
> > 
> > So the delta is 1184 us (505192062 - 504008008), and I agree with
> > you that it should be related with the tpm_tis_status warning stuff.
> > 
> > But this re-trigger my old concerns, that if the margins calculated
> > for tsc, hpet are too small?
> 
> If the error really did disturb either tsc or hpet, then we really
> do not have a false positive, and nothing should change (aside from
> perhaps documenting that TPM issues can disturb the clocks, or better
> yet treating that perturbation as a separate bug that should be fixed).
> But if this is yet another way to get a confused measurement, then it
> would be better to work out a way to reject the confusion and keep the
> tighter margins.  I cannot think right off of a way that this could
> cause measurement confusion, but you never know.

I have no doubt in the correctness of the measuring method, but was
just afraid some platforms which use to 'just work' will be caught :)

> So any thoughts on exactly how the tpm_tis_status warning might have
> resulted in the skew?

The tpm error message has been reported before, and from google there
were some similar errors, we'll do some further check.

> > With current math algorithm, the 'uncertainty_margin' is
> > calculated against the frequency, and those tsc/hpet/acpi_pm
> > timer is multiple of MHz or GHz, which gives them to have margin of
> > 100 us. It works with normal systems. But in the wild world, there
> > could be some sparkles due to some immature HW components, their
> > firmwares or drivers etc, just like this case. 
> 
> Isn't diagnosing issues from immature hardware, firmware, and drivers
> actually a benefit?  It would after all be quite unfortunate if some issue
> that was visible only due to clock skew were to escape into production.

Yes, it surely will expose some mal-functional cases which haven't
been caught before.

Thanks,
Feng


> 							Thanx, Paul

next prev parent reply	other threads:[~2021-05-16  6:34 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-01  0:32 [PATCH v12 clocksource 0/5] Do not mark clocks unstable due to delays for v5.13 Paul E. McKenney
2021-05-01  0:32 ` [PATCH v12 clocksource 1/5] clocksource: Retry clock read if long delays detected Paul E. McKenney
2021-05-01  0:32 ` [PATCH v12 clocksource 2/5] clocksource: Check per-CPU clock synchronization when marked unstable Paul E. McKenney
2021-05-01  0:32 ` [PATCH v12 clocksource 3/5] clocksource: Limit number of CPUs checked for clock synchronization Paul E. McKenney
2021-05-01  0:32 ` [PATCH v12 clocksource 4/5] clocksource: Reduce clocksource-skew threshold for TSC Paul E. McKenney
2021-05-13 15:55   ` [clocksource] 388450c708: netperf.Throughput_tps -65.1% regression kernel test robot
2021-05-13 15:55     ` kernel test robot
2021-05-13 17:07     ` Paul E. McKenney
2021-05-13 17:07       ` Paul E. McKenney
2021-05-14  7:43       ` Feng Tang
2021-05-14  7:43         ` Feng Tang
2021-05-14 17:49         ` Paul E. McKenney
2021-05-14 17:49           ` Paul E. McKenney
2021-05-16  6:34           ` Feng Tang [this message]
2021-05-16  6:34             ` Feng Tang
2021-05-17  2:30             ` Paul E. McKenney
2021-05-17  2:30               ` Paul E. McKenney
2021-05-19  6:09             ` Feng Tang
2021-05-19  6:09               ` Feng Tang
2021-05-19 18:05               ` Paul E. McKenney
2021-05-19 18:05                 ` Paul E. McKenney
2021-05-20  0:52                 ` Feng Tang
2021-05-20  0:52                   ` Feng Tang
2021-05-20  4:55                   ` Paul E. McKenney
2021-05-20  4:55                     ` Paul E. McKenney
2021-05-01  0:32 ` [PATCH v12 clocksource 5/5] clocksource: Provide kernel module to test clocksource watchdog Paul E. McKenney
2021-05-01  9:49   ` kernel test robot
2021-05-01  9:49     ` kernel test robot
2021-05-01 16:01     ` Paul E. McKenney
2021-05-01 16:01       ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210516063419.GA22111@shbuild999.sh.intel.com \
    --to=feng.tang@intel.com \
    --cc=Mark.Rutland@arm.com \
    --cc=ak@linux.intel.com \
    --cc=corbet@lwn.net \
    --cc=john.stultz@linaro.org \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lkp@intel.com \
    --cc=lkp@lists.01.org \
    --cc=maz@kernel.org \
    --cc=neeraju@codeaurora.org \
    --cc=oliver.sang@intel.com \
    --cc=paulmck@kernel.org \
    --cc=sboyd@kernel.org \
    --cc=tglx@linutronix.de \
    --cc=ying.huang@intel.com \
    --cc=zhengjun.xing@intel.com \
    --cc=zhengjun.xing@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.