From: Andreas Hollmann <hollmann@in.tum.de>
To: linux-numa <linux-numa@vger.kernel.org>
Subject: Unbalanced CPU utilization on NUMA 4-Socket machine, workload evenly spread on machine.
Date: Tue, 7 Jan 2014 09:56:58 +0100 [thread overview]
Message-ID: <CAGz0_-1KkNTW_3a7-X4+Y_HjwpwZsnKqhrjayzQ8C82aB=X8sA@mail.gmail.com> (raw)
Hi,
I discovered an uneven utilization on my NUMA machine.
The workload is evenly distributed and initialized in parallel,
it's the stream benchmark.
It's a 4-Socket WestmereEX machine with 10 cores per
node and 80 logical CPUs in total.
It runs well for about 5-10 seconds and then the performance
drops and htop shows that the CPUs on Socket 0 are utilized
100 % while socket 1 - 3 are utilized only about 50 %. I'm using
likwid-pin, which does strict thread to core pinning. First thread
on first core in list and so on.
Running the stream benchmark like this
wget www.cs.virginia.edu/stream/FTP/Code/stream.c
gcc -fopenmp -mcmodel=medium -O -DSTREAM_ARRAY_SIZE=1000000000
-DNTIMES=1000 stream.c -o stream.1000M.1000
likwid-pin -c 0-39 ./stream.1000M.1000
The peformance also drops without pinning, but then it's not clear which
thread is running on which core.
What could cause such drops and how to detect it?
- thermal threshold ( checked and everything seems ok )
- powercap ? ( is this implemented on westmereEX ?)
- bad memory? ( why does it run well in the beginning)
Here is also the output of Intel's Performance Counter Monitor (PCM) Tool:
It looks like this in the beginning (output of /opt/PCM/pcm.x 2 -nc):
This values are for 2 seconds!
EXEC : instructions per nominal CPU cycle
IPC : instructions per CPU cycle
FREQ : relation to nominal CPU frequency='unhalted clock
ticks'/'invariant timer ticks'
(includes Intel Turbo Boost)
AFREQ : relation to nominal CPU frequency while in active state
(not in power-saving C state)='unhalted clock
ticks'/'invariant timer ticks while in C0-state
(includes Intel Turbo Boost)
READ : bytes read from memory controller (in GBytes)
WRITE : bytes written to memory controller (in GBytes)
TEMP : Temperature reading in 1 degree Celsius relative to the TjMax
temperature (thermal headroom):
0 corresponds to the max temperature
Core (SKT) | EXEC | IPC | FREQ | AFREQ | READ | WRITE | TEMP
----------------------------------------------------------------
SKT 0 0.13 0.26 0.50 1.00 27.20 11.52 N/A
SKT 1 0.13 0.26 0.50 1.00 27.14 11.51 N/A
SKT 2 0.13 0.27 0.50 1.00 27.13 11.51 N/A
SKT 3 0.13 0.27 0.50 1.00 27.12 11.51 N/A
----------------------------------------------------------------
TOTAL * 0.13 0.27 0.50 1.00 108.59 46.05 N/A
Instructions retired: 42 G ; Active cycles: 160 G ; Time (TSC): 4020 Mticks
C0 (active,non-halted) core residency: 49.95 %
C1 core residency: 49.64 %; C3 core residency: 0.03 %; C6 core
residency: 0.37 %;
C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7
package residency: 0.00 %;
PHYSICAL CORE IPC : 0.53 => corresponds to 13.26 %
utilization for cores in active st
Instructions per nominal CPU cycle: 0.26 => corresponds to 6.62 %
core utilization over time interval
--- And then drops to this values ---
Core (SKT) | EXEC | IPC | FREQ | AFREQ | READ | WRITE | TEMP
----------------------------------------------------------------
SKT 0 0.06 0.12 0.50 1.00 13.82 5.28 N/A
SKT 1 0.06 0.25 0.23 1.00 12.68 4.97 N/A
SKT 2 0.06 0.25 0.23 1.00 12.67 4.97 N/A
SKT 3 0.06 0.25 0.23 1.00 12.67 4.96 N/A
----------------------------------------------------------------
TOTAL * 0.06 0.20 0.30 1.00 51.84 20.17 N/A
Instructions retired: 19 G ; Active cycles: 96 G ; Time (TSC):
4021 Mticks ;
C0 (active,non-halted) core residency: 29.84 %
C1 core residency: 53.53 %; C3 core residency: 0.00 %; C6 core
residency: 16.62 %;
C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7
package residency: 0.00 %;
PHYSICAL CORE IPC : 0.40 => corresponds to 9.91 %
utilization for cores in active sta
Instructions per nominal CPU cycle: 0.12 => corresponds to 2.96 %
core utilization over time interval
----
This lines are interesting: SKT 0 has an IPC 0.12 while SKT 1 of 0.25.
SKT 0: The FREQ is 0.50 (half of the cores are idle because of hypertreading)
SK1: Here we is FREQ only 0.23 which means that also the "active" threads
are idling. Why? Why?
Core (SKT) | EXEC | IPC | FREQ | AFREQ | READ | WRITE | TEMP
SKT 0 0.06 0.12 0.50 1.00 13.82 5.28 N/A
SKT 1 0.06 0.25 0.23 1.00 12.68 4.97 N/A
Perf similar results, peformance and bandwidth are going down by 50 %.
perf stat --per-socket --interval-print 2000 -a -e
"uncore_mbox_0/event=bbox_cmds_read/","uncore_mbox_1/event=bbox_cmds_write/"
sleep 3600
Help or suggestions would be much appreciated.
Thanks and best regards,
Andreas
reply other threads:[~2014-01-07 8:56 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAGz0_-1KkNTW_3a7-X4+Y_HjwpwZsnKqhrjayzQ8C82aB=X8sA@mail.gmail.com' \
--to=hollmann@in.tum.de \
--cc=linux-numa@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).