Unbalanced CPU utilization on NUMA 4-Socket machine, workload evenly spread on machine.

linux-numa.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Andreas Hollmann <hollmann@in.tum.de>
To: linux-numa <linux-numa@vger.kernel.org>
Subject: Unbalanced CPU utilization on NUMA 4-Socket machine, workload evenly spread on machine.
Date: Tue, 7 Jan 2014 09:56:58 +0100	[thread overview]
Message-ID: <CAGz0_-1KkNTW_3a7-X4+Y_HjwpwZsnKqhrjayzQ8C82aB=X8sA@mail.gmail.com> (raw)

Hi,

I discovered an uneven utilization on my NUMA machine.
The workload is evenly distributed and initialized in parallel,
it's the stream benchmark.

It's a 4-Socket WestmereEX machine with 10 cores per
node and 80 logical CPUs in total.

It runs well for about 5-10 seconds and then the performance
drops and htop shows that the CPUs on Socket 0 are utilized
100 % while socket 1 - 3 are utilized only about 50 %. I'm using
likwid-pin, which does strict thread to core pinning. First thread
on first core in list and so on.

Running the stream benchmark like this

wget www.cs.virginia.edu/stream/FTP/Code/stream.c

gcc -fopenmp -mcmodel=medium -O -DSTREAM_ARRAY_SIZE=1000000000
-DNTIMES=1000 stream.c -o stream.1000M.1000

likwid-pin -c 0-39 ./stream.1000M.1000

The peformance also drops without pinning, but then it's not clear which
thread is running on which core.

What could cause such drops and how to detect it?

- thermal threshold ( checked and everything seems ok )
- powercap ? ( is this implemented on westmereEX ?)
- bad memory? ( why does it run well in the beginning)

Here is also the output of Intel's Performance Counter Monitor (PCM) Tool:

It looks like this in the beginning (output of /opt/PCM/pcm.x 2 -nc):

This values are for 2 seconds!

EXEC  : instructions per nominal CPU cycle
 IPC   : instructions per CPU cycle
 FREQ  : relation to nominal CPU frequency='unhalted clock
ticks'/'invariant timer ticks'
         (includes Intel Turbo Boost)
 AFREQ : relation to nominal CPU frequency while in active state
         (not in power-saving C state)='unhalted clock
ticks'/'invariant timer ticks while in C0-state
         (includes Intel Turbo Boost)
 READ  : bytes read from memory controller (in GBytes)
 WRITE : bytes written to memory controller (in GBytes)
 TEMP  : Temperature reading in 1 degree Celsius relative to the TjMax
temperature (thermal headroom):
         0 corresponds to the max temperature


 Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | READ  | WRITE | TEMP

----------------------------------------------------------------
 SKT    0     0.13   0.26   0.50    1.00    27.20    11.52     N/A
 SKT    1     0.13   0.26   0.50    1.00    27.14    11.51     N/A
 SKT    2     0.13   0.27   0.50    1.00    27.13    11.51     N/A
 SKT    3     0.13   0.27   0.50    1.00    27.12    11.51     N/A
----------------------------------------------------------------
 TOTAL  *     0.13   0.27   0.50    1.00    108.59    46.05     N/A

 Instructions retired:   42 G ; Active cycles:  160 G ; Time (TSC): 4020 Mticks
 C0 (active,non-halted) core residency: 49.95 %

 C1 core residency: 49.64 %; C3 core residency: 0.03 %; C6 core
residency: 0.37 %;
 C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7
package residency: 0.00 %;

 PHYSICAL CORE IPC                 : 0.53 => corresponds to 13.26 %
utilization for cores in active st
 Instructions per nominal CPU cycle: 0.26 => corresponds to 6.62 %
core utilization over time interval


--- And then drops to this values ---

 Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | READ  | WRITE | TEMP

----------------------------------------------------------------
 SKT    0     0.06   0.12   0.50    1.00    13.82    5.28     N/A
 SKT    1     0.06   0.25   0.23    1.00    12.68    4.97     N/A
 SKT    2     0.06   0.25   0.23    1.00    12.67    4.97     N/A
 SKT    3     0.06   0.25   0.23    1.00    12.67    4.96     N/A
----------------------------------------------------------------
 TOTAL  *     0.06   0.20   0.30    1.00    51.84    20.17     N/A

 Instructions retired:   19 G ; Active cycles:   96 G ; Time (TSC):
4021 Mticks ;
 C0 (active,non-halted) core residency: 29.84 %

 C1 core residency: 53.53 %; C3 core residency: 0.00 %; C6 core
residency: 16.62 %;
 C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7
package residency: 0.00 %;

 PHYSICAL CORE IPC                 : 0.40 => corresponds to 9.91 %
utilization for cores in active sta
 Instructions per nominal CPU cycle: 0.12 => corresponds to 2.96 %
core utilization over time interval

----

This lines are interesting: SKT 0 has an IPC 0.12 while SKT 1 of 0.25.
SKT 0: The FREQ is 0.50 (half of the cores are idle because of hypertreading)
SK1: Here we is FREQ only 0.23 which means that also the "active" threads
are idling. Why? Why?

 Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | READ  | WRITE | TEMP

 SKT    0     0.06   0.12   0.50    1.00    13.82    5.28     N/A
 SKT    1     0.06   0.25   0.23    1.00    12.68    4.97     N/A


Perf similar results, peformance and bandwidth are going down by 50 %.

perf stat --per-socket --interval-print 2000 -a -e
"uncore_mbox_0/event=bbox_cmds_read/","uncore_mbox_1/event=bbox_cmds_write/"
sleep 3600

Help or suggestions would be much appreciated.

Thanks and best regards,
Andreas

                 reply	other threads:[~2014-01-07  8:56 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAGz0_-1KkNTW_3a7-X4+Y_HjwpwZsnKqhrjayzQ8C82aB=X8sA@mail.gmail.com' \
    --to=hollmann@in.tum.de \
    --cc=linux-numa@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).